Stacking L2 cache, RIKEN shows 10X speedup for A64FX by 2028

0

Let the era of 3D V-Cache in HPC begin.

Inspired by the idea of ​​AMD’s Epyc 7003 “Milan-X” processors with their stacked L3 V-Cache 3D cache, then powered by real world benchmark tests pitting classic Milan processors against Milan-X processors using apps Real and synthetic HPCs, researchers at the RIKEN Lab in Japan, where the “Fugaku” supercomputer based on Fujitsu’s impressive A64FX vectored Arm server chip, ran a simulation of a hypothetical A64FX successor that could, in theory, be built in 2028 and provide nearly a command of magnitude more performance than the current A64FX.

AMD let the world know it was working on 3D V-Cache for desktop and server processors in June 2021, and showed off a custom Ryzen 5900X chip with the L3 cache stacked atop the die, tripling its capacity with a single stack. (The stacked L3 is twice as dense because it has none of the control circuitry that the “real” L3 cache on the die has; the stacked cache, in effect, piggybacks onto the cache pipes and controllers on the die.) Last fall, ahead of the SC21 supercomputing conference, AMD unveiled some of the features of the Milan-X benchmarks of the Epyc 7003 processor line, and the Milan-X chips were unveiled in March of this year .

We believe 3D V-Cache will be used on all CPUs, eventually, once manufacturing techniques are perfected and cheap enough, and precisely because it will free up die area to be dedicated to more cores. Calculation. (We discussed this, among other things, in an interview with Dan McNamara, senior vice president and general manager of the server business at AMD.

The RIKEN researchers, in collaboration with colleagues from the Tokyo Institute of Technology and the National Institute of Advanced Industrial Science and Technology in Japan, the KTH Royal Institute of Technology and Chalmers University of Technology in Sweden, and the Indian Institute of Technology, got their hands on servers with AMD Epyc 7773X (Milan-X) and Epyc 7763 (Milan) processors, and ran the finite element proxy application MiniFE – one of the important codes used by the Exascale Computing project in the US to test the scale of exascale HPC machines – on both machines and showed that the additional L3 cache helped increase performance by a factor of 3.4. Looked:

In the HPC world, 3.4X is nothing out of the ordinary. And that made the RIKEN team think: what if 3D V-Cache was added to the A64FX processor? And more specifically, it got them wondering how they could marry the V-Cache 3D to a hypothetical A64FX kicker, which would be expected in about six years depending on the cadence between the “K” supercomputer using the Sparc64- VIIIfx processor from 2011 and Fugaku supercomputer based on A64FX from 2020. So they ran Gem5 chip simulator and simulated what a future A64FX kicker is with large L2 cache (A64FX doesn’t have ​​no shared L3 cache, but has a large shared L2 cache), which it substitutes LARC for large cache, might look like and how it might perform on popular RIKEN codes and HPC benchmarks.

It’s the power of creating digital twins, aptly illustrated. And if you read the article published by RIKEN and its collaborators, you will see that this task is not as simple as it seems in the Nvidia and Meta Platforms advertisements, but it can be done well enough to obtain a first-order approximation of obligate to fund future research and development.

We’ve done many deep dives on Fukagu ​​system technology, but the biggest one is this piece in 2018. And this snapshot of the A64FX processor is a good place to start the discussion of the hypothetical A64FX CPU kicker:

The A64FX has four main memory groups, or CMGs, which have custom Arm cores created by Fujitsu 13 cores, a piece of shared L2 cache, and an HBM2 memory interface. One of the cores in each group is used to handle I/O. which leaves 48 user-addressable cores and performs the calculations; each core has a 64 KB L1 data cache and a 64 KB L1 instruction cache, and the CMG has an 8 MB segmented L2 cache, with 32 MB slices per core. The design has no L3 cache, and that’s on purpose because L3 caches often cause more contention than they’re worth, Fujitsu has always believed in the largest possible L2 caches in its Sparc64 CPU designs, and the A64FX carries this philosophy forward. The L2 cache has 900 GB/s of bandwidth in the cores of each CMG, so the overall L2 cache bandwidth on the A64FX processor is 3.6 TB/s. Each A64FX core can do 70.4 64-bit double-precision floating-point gigaflops, or 845 gigaflops per CMG and 3.4 teraflops across the entire processor complex, which is etched in a 7-nanometer process by Taiwan Semiconductor Manufacturing Co .

Believe it or not, RIKEN didn’t have the floor plan for the A64FX processor and had to estimate it from die plans and other specs, but it did as a starting point for the Gem5 simulator , which is open source and used by many. technology companies. (It is impossible that this estimate was made without even unofficial Fujitsu approval and review, and we therefore believe that the floor plan used for the A64FX is absolutely accurate.)

Once it had this A64FX floor plan, the RIKEN team assumed that a 1.5 nanometer process would be available for a shipping processor in 2028. What this dramatic shrinkage allows is that the CMG in the hypothetical LARC chip is a quarter of the size of the one used. in the A64FX, which means the LARC processor can have 16 CMGs. Assuming that L2 cache can be reduced at a similar rate and scale, the idea with the LARC chip is to have eight L2 cache arrays with through-chip interfaces (TCI) gluing them all together, from the same way as silicon vias (TSV). ) are used in HBM memory stacks. RIKEN thinks this stacked memory can run at around 1GHz, based on simulations and other stacked SRAM research, and as you can see from the comparative CMG diagrams below, he thinks he can create stackable SRAM L2 chips that cover the entire GCM:

This is an important distinction, because AMD’s 3D V-Cache only stacks L3 cache on top of L3 cache, not compute arrays. But we’re talking about 2028 here, and we need to understand the materials people are going to address the heat dissipation issues related to the L2 cache above the CPU cores. Operating at 1 GHz, the bandwidth on the L2 cache in the LARC CMG will be 1536 GB/s, a factor of 1.7 times faster than on the A64FX, and the cache capacity will be 384 MB per CMG or 6 GB on the socket. Additionally, RIKEN estimates that the CMG will have 32 cores, a factor of 2.7 times higher.

To isolate the effect of the stacked L2 cache on the LARC design, RIKEN kept the HBM2 memory at the HBM2 level and did not simulate what HBM4 might look like and kept the same HBM memory capacity at 256 GB. We wonder if CMGs with such high L2 cache bandwidth could not be limited by HBM2 bandwidth and what might happen to the overall performance of a 512 core LARC socket might be inhibited by this relatively low HBM2 bandwidth and this low capacity (only 8 GB per CMG). But sticking with HBM2 with one controller per CMG and having 16 CMGs on the LARC is only 128 GB of HBM2 memory and 4 TB/sec of bandwidth per LARC socket. An improvement factor of 4X will likely be needed for the actual LARC design, we assume, to keep everything balanced, assuming the clock speed on the cores and non-core region doesn’t change much.

To isolate the effect of fundamental differences, the RIKEN researchers created a 32-core CMG, but left the cache size at 8MB per CMG and 800GB/sec bandwidth. And just for fun, they created a LARC design with 256MB of L2 cache and the same 800GB/sec of bandwidth (probably with just four L2 stacks) to isolate the effect of capacity on HPC performance, and then loaded onto 512MB of L2 cache running at 1.5GB/sec. By the way, the LARC cores are the same custom 2.2GHz cores used in the A64FX – no changes here either. And you know very well that by 2028, Fujitsu, Arm and RIKEN will have a radically better kernel, but it may not have better 512-bit vectors. (We will see.)

Here’s how the hypothetical LARC and actual A64FX chips measured against each other in the STREAM Triad memory bandwidth benchmark test:

This test shows bandwidth as the number of OpenMP threads increases, and the one below shows how STREAM performance changes with vector input data size from 2 KB to 1 GB.

The spike in the LARC lines is because with 2.7 times the cores of a CMG you also have 2.7 times the amount of L1 cache, so the smaller vectors all fit in the L1 cache ; then larger vectors can now fit in the L2 cache for a long time. So the simulated LARC chip activates it until all that cache is exhausted.

Here is the important thing after all this simulation. Through a large suite of HPC benchmarks, including real applications running at RIKEN and other supercomputing centers as well as a host of HPC benchmarks that we are all familiar with, the LARC CMG was able to deliver approximately 2 times more performance on average and it was as high as 4.5X for some workloads. Add to that a quadrupling of the CMGs and you have a CPU socket that could be anywhere from 4.9 to 18.6 times more powerful. The geometric mean of performance improvements between A64FX and LARC for L2 cache sensitive applications is 9.8X. Besides:

The Gem5 simulations were run at the CMG level because the Gem5 simulator could not handle the full LARC socket with sixteen CMGs, and RIKEN had to make assumptions about how this scale would work in the socket.

Share.

Comments are closed.