AMD Finds Zen in Microarchitecture
New 14nm Zen Core Will Power Processors Across All Markets
By David Kanter
The Zen microarchitecture offers a fresh start for AMD’s computing ambitions. It’s the company’s first CPU in a FinFET node, and it offers 40% higher IPC and power efficiency than the prior generation. Zen will serve in notebooks, desktops, and servers and will enable far more-competitive x86 products in 2017.
For the last six years, AMD’s position in the processor world has eroded. The company has been stuck on planar 32nm and 28nm technologies, and the Bulldozer CPU core and its derivatives have failed to keep pace with Intel’s steady stream of architecture improvements and FinFET-based 22nm and 14nm processes. Although AMD’s power-management, graphics, and media blocks are competitive, its lack of a power-efficient high-performance core has prevented sales to anything but low-end client systems. In the server market, the company’s share has fallen from a high of 25% to trace levels. Yet customers remain eager for AMD to field competitive products that are a viable alternative to Intel.
The Zen core is a dramatically better microarchitecture than the previous generation. In an era when a 10% improvement is huge, AMD had to rethink most of the CPU core. It added two-way simultaneous multithreading and abandoned the conjoined-core approach of the Bulldozer family. Each core integrates a new micro-op cache to avoid x86 decoding overhead, as well as a redesigned L1 cache with higher-performance and lower-power writeback caching, a private FPU and L2 cache, and many smaller modifications.
The basic integer pipeline is 19 stages using the conventional instruction fetch, which is similar to that of most high-performance cores. Zen is AMD’s first CPU to employ 14nm FinFETs, which should reduce the voltage to provide roughly 30% more power efficiency compared with older 28nm HKMG technology.
Today’s mainstream AMD products are based on the Bulldozer family, which also includes the Piledriver, Steamroller, and, most recently, Excavator CPUs. As a major upgrade, Zen will trigger a refresh across all the company’s product lines. In 1Q17, it’s scheduled to find a home in a new eight-core desktop-PC processor, code-named Summit Ridge, that is compatible with the existing AM4 socket. Slated to follow this design in 2Q17 is a 32-core server processor, code-named Naples. Last will come new notebook-PC processors in 2H17. AMD will announce product-specific details as these devices emerge, starting later this year.
Grasping for Privacy Threads
One of the biggest and most pervasive changes in Zen is multithreading and core partitioning. The previous-generation Bulldozer shared parts of the pipeline, the floating-point and SIMD units, and the L2 cache between two cores (see MPR 8/30/10, “AMD Bulldozer Plows New Ground”). All those resources are now private to each core, as Figure 1 illustrates, but they are shared between two threads in a core, similar to Intel’s Hyper-Threading. By default, most resources are shared competitively, but a small number are statically partitioned.
Figure 1. Block diagram of AMD Zen microarchitecture. The Zen core includes a new micro-op cache and a writeback L1 data cache. It has two-way multithreading and much wider internal buffers with private FP and SIMD units as well as a private L2 cache.
Although the Zen front end is quite conventional for a high-performance x86 processor, it’s a big change for AMD. It is the company’s first processor with a micro-op cache, which first appeared in the Intel Sandy Bridge microarchitecture; this feature improves performance and saves power.
The instruction stream is tracked in 64-byte windows—the same granularity as branch prediction—although instruction fetches are 32 bytes wide. Branch prediction is dynamic, using three different mechanisms. For conditional branches, a perceptron-based predictor checks the two-level branch target buffer (BTB). AMD withheld the size and associativity of the L1 and L2 BTB, but similar to Jaguar, each line can hold two branches if the first one is predicted as not taken. Predictions from the L2 BTB will likely stall fetches for 1–3 cycles.
Once Zen detects a dynamic indirect branch, it will move that branch to a special indirect target array (ITA) that is 512 entries and direct mapped. Tracking of call and return pairs is through a 32-entry return address stack, which is statically partitioned and can recover from incorrect speculation. Most other branch-prediction resources are shared, and the processor can algorithmically prioritize one thread in response to events. For example, when one thread recovers from a branch misprediction, the recovering thread receives priority.
Unlike previous generations, the instruction TLB is part of the branch-predictor block, which enables more-aggressive prefetching and provides a physical address earlier in the pipeline. The ITLB is a three-level structure that AMD starts numbering at zero. The L0 and L1 ITLBs are fully associative with 8 and 64 entries that can cache 4KB, 2MB, and 1GB page mappings. The L2 ITLB is 512 entries and four-way set associative; it doesn’t map 1GB pages. Hits in the L1 and L2 ITLBs inject several delay cycles into the fetch pipeline. Moving the TLBs into the branch-prediction pipeline enables better prefetching (e.g., using physical addresses), but AMD declined to reveal any prefetching details.
Micro-op Cache Evades x86 Decode Tax
Once a physical address for the instruction pointer is determined, it goes into a request queue for the conventional instruction cache and is used to probe the microtags for the new micro-op cache. The microtags indicate whether a given address is present and, if so, which cache way is predicted to contain the desired micro-ops.
AMD withheld details of the micro-op-cache organization. A micro-op-cache hit can read out the entire line in one cycle, although most lines are only partially packed, so the cache will typically sustain a throughput lower than the theoretical limit. Given that the core was designed to sustain six micro-ops per clock and that the micro-op-cache lines are partially packed, the lines must be at least six micro-ops—probably more. A micro-op-cache hit also reduces the instruction-pipeline length by two stages and cancels the L1 instruction-cache request, saving power.
The micro-op cache is filled by the conventional instruction-fetch-and-decode pipeline, but it’s neither inclusive nor exclusive of the L1 instruction cache. As a result, self-modifying code is more difficult, as it must check and potentially invalidate both caches. Since the TLBs are earlier in the pipeline, the micro-op cache may be physically addressed, unlike Intel’s virtually addressed micro-op cache.
Instruction fetches are directed to a physically addressed 64KB instruction cache that is four-way set associative and comprises 64-byte lines. On a hit, the cache fills 32 bytes into the instruction byte buffer, which decouples the fetch and decode. AMD kept the size of the instruction byte buffer under wraps, but Bulldozer had 16 entries; Zen is likely to use a larger structure. Instruction-cache misses will initiate an L2 request and are filled by way of a 32-byte bus.
The Zen decoder reads up to four x86 instructions and can emit up to eight micro-ops, but it typically emits four micro-ops, since most instructions map to one micro-op. The decoder includes a dedicated stack engine that can eliminate PUSH and POP instructions, along with a memory file to track the actual data dependencies. Apparently, AMD removed the stack engine from the Bulldozer and Excavator cores—one of many unfortunate architectural mistakes. The decoders can fuse compare and jump for branches as well.
After decoding, micro-ops go into a 72-entry micro-op queue, which is where the conventional fetch-and-decode path and micro-op cache converge. The micro-op queue is statically partitioned between threads.
Integers Gone Wide
After all decoding is finished, micro-ops are dispatched to the back end for renaming, scheduling, and execution. As with many AMD processors, the back end is logically and physically split into two halves: integer and memory in one half and floating-point and SIMD functions in the other.
The dispatcher can send up to six micro-ops to the integer side, which is much wider than the four-wide Bulldozer family. Zen tracks all micro-ops through a 192-entry retirement queue and physical register files. Register moves are resolved by changing the register mapping rather than by executing a micro-op, although AMD declined to indicate whether the renamer can zero out registers through remapping. The core state is check pointed, enabling faster recovery from branch mispredictions and other pipeline flushes. The retirement queue is statically partitioned and can retire up to eight micro-ops per cycle from a single thread in each cycle.
Integer operations must also allocate 1 out of 168 physical registers and a scheduler entry. The registers are competitively shared with algorithmic priority. Zen has four integer schedulers, each with 14 entries. Each scheduler fans out to an ALU, which can execute basic integer operations including shifts. Two-operand LEA (load effective address) instructions decode into ADD micro-ops and can use any ALU pipeline, while three-operand versions employ the address-generation unit (AGU) instead. In addition, ALUs 0 and 3 execute branches, ALU 1 performs multiplication, and ALU 2 handles division.
Back to Writeback Caching
Memory micro-ops are more complex, requiring registers as well as entries in both a scheduler and the load or store queue. Accesses are tracked in a 72-entry load queue and a 44-entry store queue until they become globally observable. Unlike most other resources, the store buffer is statically partitioned between threads. Memory requests can enter either of the two 14-entry schedulers. The memory pipelines are 128 bits wide, so a 256-bit access (e.g., AVX) takes two slots in the register file, memory queue, and schedulers. Load requests can be reordered around other loads and stores with known addresses and can be speculatively moved ahead of stores with unknown addresses. Gather instructions are microcoded, similar to the previous-generation design, Excavator.
Each scheduler issues memory-address micro-ops to an address-generation unit, which calculates the virtual address. The virtual address is translated in the DTLB, which has two levels. The L1 DTLB, which contains 64 fully associative entries, caches all page sizes. The L2 DTLB, which caches 4KB and 2MB mappings, holds 1.5K entries and is six-way set associative. Zen includes two hardware page-table walkers to service all TLB misses (instruction and data for both threads).
The data-cache tags are probed in parallel with the TLBs, since the 32KB cache is eight-way set associative. The load-to-use latency for integer requests is four cycles. The cache can service two 16-byte requests per clock. Misaligned loads have a single-cycle penalty, assuming both accesses hit in the cache. The data cache has a writeback allocate-on-write policy, with data protected by ECC. Zen’s writeback caching is a big improvement over Bulldozer’s write-through caching, since write-through caching generally reduces performance and increases power.
Loads that miss the data cache will occupy a miss-address buffer while checking the L2 cache and beyond. AMD withheld the maximum number of outstanding misses, but we expect it’s 8 to 14.
Stores require an address calculation and also generate a store-data micro-op, which Zen handles as a register read to an ALU pipeline that can write 16 bytes per clock to the store buffer before draining into the cache. The data cache maintains coherence using the MOESI protocol. Stores to uncacheable memory instead write to one of eight write-combining buffers. Zen includes CLZERO, an AMD-unique instruction that zeroes out an entire 64-byte cache line using a single store operation. Although the data cache can service two 16-byte reads and one 16-byte write per clock, the AGUs can generate only two addresses per clock, limiting the sustainable bandwidth.
Seeking Solace With Private Caches
Another big change in the memory hierarchy is a private L2 cache, compared with the high-latency shared L2 cache in the Bulldozer family. The 512KB L2 cache is eight-way set associative and is inclusive of the L1 data, L1 instruction, and micro-op caches, dramatically reducing snoop traffic to the other per-core caches. The L2 services requests from the L1 caches using a 32-byte bus to each one. The data and L2 caches include prefetchers, but AMD disclosed no further details.
A group of four cores and a shared L3 cache form a CPU complex. The L3 cache for a CPU complex is a mostly exclusive victim cache for the private L2 caches and is implemented as four slices. This 8MB cache is 16-way set associative, and data is spread across it using address interleaving on low-order bits rather than using a hash function. This approach should improve locality at the cost of creating address conflicts. The L3 cache can send 32 bytes per clock to each L2 cache in the CPU cluster. Larger chips will have multiple clusters, which will communicate through a coherent fabric that’s 32 bytes wide. AMD withheld additional details about the L3 cache and overall SoC architecture.
Modest Floating-Point and SIMD Ambitions
As with most AMD processors, floating-point (FP) and SIMD execution is separate from integer and memory operations. Although Zen supports up to AVX2, the entire FP pipeline (and memory hierarchy) is optimized for 128-bit operations; 256-bit instructions require twice the resources (e.g., registers and scheduler entries). Zen can dispatch up to four micro-ops to the FP half of the back end, which includes its own renaming and scheduling resources.
FP and SIMD micro-ops are tracked in the 192-entry retirement queue but are renamed onto a bank of 160x128-bit physical registers (which are shared between threads with prioritization). Operations also allocate from two schedulers, which can hold a total of 96 micro-ops. The first scheduling queue (SQ) can send operations to four FP execution units. The second scheduling queue (NSQ) cannot issue micro-ops; it simply holds FP micro-ops until SQ entries are free. The NSQ ensures that operations containing an FP micro-op (e.g., register-memory computation and FP stores) can be scheduled on the integer side, even if the 96-entry scheduler is full. It prevents the FP back end from causing dispatch stalls in the processor. SIMD micro-ops use the same scheduler and execution units as FP micro-ops. The individual scheduler sizes are undisclosed.
Loads take an extra three cycles to send data to the FP/SIMD side, but AMD declined to reveal the delay for other communication between the two halves (e.g., FP conversion, branch on FP, and result forwarding). The penalty is likely 1–3 cycles, however.
The Zen FP execution units are designed to take advantage of fused-multiply-add (FMA) instructions. Zen has two FMA pipelines that also handle multiplication, as well as two add pipelines. These pipelines can simultaneously execute two 128-bit multiplies and two 128-bit adds. But an FMA operation steals an extra read port for the third operand, blocking the add pipeline. One benefit of the separate add pipeline is that the latency for addition is three cycles compared with four in the Intel Skylake architecture. The drawback is that the FMA latency is probably worse than Skylake’s four clock cycles—we estimate five or six cycles, whereas AMD and Intel both use four cycles to complete a multiply.
Two of the FP units, most likely the multipliers, also include hardware for executing AES instructions. We expect one of the add pipelines contains a shuffle and permute unit so it can function without blocking an FMA. A single pipeline handles the register-read portion of stores, which can send 128 bits to the memory pipelines.
Performance Approaches Haswell’s
AMD has withheld Zen’s physical-design details, such as frequency. But we believe the FO4 depth of each stage is similar to Excavator’s, so it should achieve similar clock speeds in the same process. The 14nm FinFET process, however, provides a combination of higher frequency and greater efficiency. We expect the Zen core will ship at roughly 3.2GHz; the peak frequency (i.e., accounting for dynamic voltage and frequency scaling) is difficult to estimate.
AMD has disclosed little useful performance data, but it provided guidance that Zen’s IPC is 40% better on SPECint_rate2006 than Excavator’s, which is in turn 15% better than Steamroller’s. Unfortunately, reliable SPEC CPU scores for AMD processors are nearly impossible to find, as previous products were highly uncompetitive and the company refused to submit results. In fact, Intel generated the best reported SPECint_rate score for a recent AMD processor: 90.3 (base) for a quad-core A10-7850K—a 3.7GHz (4.0GHz turbo) chip with four Steamroller cores—running Windows using ICC 14.0.
Figure 2 shows our estimates for Excavator and Zen. First, we recalculated the A10-7850K’s benchmark score without libquantum, which ICC has cracked, resulting in an adjusted score of 81.4. Increasing that number by 15% for an Excavator-based design should yield 93.6, and a 40% boost from moving to Zen yields 131. We further expect that using a compiler optimized for Zen (instead of Intel’s compiler) would boost performance by another 10% to 144.
Figure 2. Comparison of estimated performance for recent x86 desktop processor cores. The quad-core Zen should perform similarly to Intel’s Ivy Bridge CPU, assuming it can reach 3.7GHz. (Source: The Linley Group estimates based on data from SPEC.org)
Figure 2 also shows recent Intel desktop processors (e.g., Core i7-2600K, 3770K, 4770K, 6700K) and their adjusted SPEC scores (i.e., without libquantum). Overall, our estimate for Zen is fairly close to the score for Intel’s Ivy Bridge but short of Haswell’s and Skylake’s. AMD argues that the Intel products receive an unfair 5–10% boost because the company compiles to 32-bit x86 code, which is unrealistic for many applications. Adjusting by 10% would put Zen about halfway between Ivy Bridge and Haswell for SPECint_rate2006. Either way, Zen delivers a tremendous improvement in per-core performance.
Good for Servers But Not HPC
The exact range for Zen-based PC processors will depend on the frequency. Intel’s PC processors clock the CPU cores at 2–4GHz. Because Zen has a slight IPC deficit, it will need a higher frequency, all other things being equal, to match a Skylake core. The CPU is just one of several components in a PC processor, however—it’s clearly the most important, but graphics, media, image processing, and display interfaces are all critical for a well-rounded product. Any of these factors (along with lower prices) can sway OEM and end-user buying decisions.
To maximize multicore throughput, most Intel server processors operate at 2–3GHz, which is quite feasible for Zen. But servers require more than just low-power cores (e.g., 22 cores in 135W): they need a high-bandwidth low-latency L3 cache and fabric, coherent links, memory and I/O controllers, power management, and excellent overall integration. The new Zen core is a necessary but insufficient condition for server success. AMD has experience with many of these components and certainly understands what is necessary, but its disappearance from the mainstream server market means it must do more work to refresh its server products. The company is declining to discuss these other platform factors.
The Zen core does have some limitations that make it less suitable for scientific computing, which accounts for 15–20% of the server market. It sacrifices floating-point and SIMD throughput to reduce area and power—important metrics for this segment. As Table 1 illustrates, Zen offers more FP flexibility than Sandy Bridge and will deliver much better performance on SSE code. Haswell and Skylake, however, provide twice the flops per clock using AVX FMA instructions and, more importantly, twice the cache bandwidth to feed the FP and SIMD execution units.
Table 1. Comparison of floating-point units and cache hierarchy. In FP and cache performance, Zen is similar to Sandy Bridge but falls short of Haswell and the Skylake client core. *or 2x 128-bit FMUL + 2x 128-bit FADD. (Source: vendors)
The forthcoming server version of Skylake will further double the computational throughput using AVX512 and also increase the cache bandwidth. Practically, Zen will therefore struggle to offer competitive HPC performance both for classic scientific computing and for workloads such as machine learning that require dense computations. Fortunately for AMD, the bulk of the server market is running databases, web servers, and other tasks that fall outside scientific computing.
AMD Reboots With Zen
Almost every company has produced one or two subpar architectures: Intel had the P4 and Itanium, IBM had the Power6 and Cell, ARM had the Cortex-A8, Sun had the UltraSparc V, and Nvidia had the NV30. After winning plenty of battles with the K8 microarchitecture, AMD’s Waterloo was the Bulldozer line, which all but ended the company’s presence in the lucrative server market as well as in midrange client systems. After five difficult years, the Zen core is slated to reset the competitive landscape and reinvigorate AMD’s product line.
On the basis of our estimates, the 14nm Zen core should offer performance somewhere between that of Intel’s Ivy Bridge and Haswell generations on integer workloads. Although Zen-based processors cannot rival the latest Skylake core in high-end clients, AMD’s eight-core Summit Ridge chip should be a credible contender for midrange desktops. The company will thus have a shot at PC designs that were previously out of reach, expanding its market share and increasing average selling prices. Future Zen-based notebook processors should be similarly compelling, although they won’t arrive until late 2017.
In servers, Zen could enable midrange and low-end designs with the right complementary components (e.g., L3 cache, memory, and PCIe controllers) in 2017. Most of the server market today comprises two-socket designs, and AMD has demonstrated a two-socket server employing the 32-core Naples processor. Given that Naples will arrive shortly after Summit Ridge, we suspect both products will use the same die, but the server product will use a multichip package, an approach AMD uses in its current Opteron products. The inherent latencies of a multichip approach would require customers to tightly control the locality of their workloads. That would make Naples a good fit for mega-data-center customers such as Amazon, Baidu, and Google.
Naples includes more cores than we expect from Intel’s 28-core Skylake-EP, but we think Intel will still have better performance and power efficiency. If AMD can come within 20% of Intel, however, customers will happily buy quite a few chips and reopen the server market. Given that AMD’s market share in servers is nearly nonexistent, even a few design wins at large data center customers could make a big difference—particularly in AMD’s revenue.
PC and server vendors have had just one choice for high performance and power efficiency over the last few years, leaving them eagerly awaiting an alternative. Zen gives AMD a crucial component to building attractive processors, but now the company must deliver complete products, win designs, and ramp manufacturing—which should all start happening in the next year.
Price and Availability
AMD Zen-based processors are expected to ship in 1Q17; prices are not yet available. For more information, access www.amd.com/en-us/innovations/software-technologies/zen-cpu.