Microprocessor Report (MPR) Subscribe

GlobalFoundries Wins AI Chips

12LP+ Process Improves Transistors, IP to Accelerate Neural Networks

July 27, 2020

By Linley Gwennap


Chip giants like Nvidia can afford 7nm technology, but startups and other smaller companies struggle with the difficult design rules and high tapeout cost, all for modest gains in transistor speed and cost. GlobalFoundries offers an alternative route with its new 12LP+ technology, which reduces power by cutting voltage rather than transistor size. It has also developed new SRAM and multiply-accumulate (MAC) circuits optimized specifically for AI acceleration. The result: typical AI operations consume up to 75% less power. Customers such as Groq and Tenstorrent are already achieving industry-leading results using the initial 12LP technology, and the first products manufactured in 12LP+ will tape out later this year.

To achieve these results, GlobalFoundries (GF) took a holistic approach to AI acceleration, specifically infer­encing convolu­tional neural networks (CNNs). This work-load relies heavily on MAC operations, but the com­pany discov­ered that most of the power actually goes to reading data from local SRAM and transferring it to the MAC units. The new SRAM design greatly decreases pow­er for CNNs and other applications that commonly access long data vectors. The new MAC design targets the smaller data types and lower clock speeds of most AI accelerators, also con­serving power. The voltage reduction comes from rede­sign­ing the paired transistors in the SRAM cell to im­prove matching, thereby shrinking the required voltage margin.

GF took this path after dropping its plan for 7nm and beyond to instead focus on FD-SOI, SiGe, and other differentiated technologies (see MPR 8/13/18, “GlobalFoundries’ New Strategy”). The 12LP+ and the AI efforts are further examples of its differentiation strategy. The advantages of this approach are in some ways greater than those of 7nm but come at a lower cost. Previously, the foundry had focused on building AMD’s high-performance CPUs, but as AMD has moved them to TSMC, the revised strategy has helped GF attract new customers.

Designed for AI

In a typical high-performance CPU, the local SRAM provides a full cache line every cycle, then the CPU selects the desired word through a multiplexer (mux). For example, a 64-bit CPU that uses a 256-bit cache line requires a 4:1 mux, as Figure 1(a) shows. In this case, all 256 bit lines in the SRAM array discharge on every access, even though the CPU employs only 64 bits per cycle. This approach minimizes the SRAM latency, potentially increasing the maximum clock speed or reducing the number of pipeline stages—both critical factors for CPU performance.

Figure 1. GlobalFoundries AI-specific memory. A general-purpose array minimizes latency for random accesses. Adding a latch increases latency but reduces power for sequential accesses.

AI accelerators typically operate at lower clock speeds than PC processors, and their designers care more about throughput than latency. Furthermore, whereas CPUs often have random access patterns, CNNs generate sequential memory accesses as they operate on vectors that often have hundreds or thousands of elements. To better support these designs, GlobalFoundries added a latch between the SRAM array and the mux, as Figure 1(b) shows. Doing so adds a cycle to the read path, which a CPU designer would never accept, but it provides considerable benefits for AI accelerators.

First, the latch decouples the mux from the array, reducing the capacitance on the bit lines and thus reducing power on each SRAM access. But the bigger benefit is that the full 256-bit output remains available in the latch after the read operation. If the following read operation accesses the next incremental memory address, that value can be read from the latch without driving the array at all. For a program that reads from a long series of sequential addresses, this design powers the SRAM array only 25% of the time. Considering the entire circuit, including the mux and the latch, GF estimates a 53% power reduction for CNN workloads relative to a standard compiled SRAM. Because of relaxed timing constraints, the new SRAM is also 25% smaller.

Although the MAC units contribute a smaller portion of the total power, they often constitute the largest portion of the total die area. The new design has a 16x16-bit multiplier, differing from the 64-bit designs that high-end CPUs require. The radix-4 Booth multiplier feeds into a 48-bit adder for high-precision accumulation. For the 8-bit integer (INT8) data common in CNN inferencing, the MAC unit can be split to produce two 8x8 multiplies per cycle with 24-bit accumulation. Targeting 1.0GHz operation allowed GF to simplify the physical design, cutting both power and die area. The new MAC unit is 12% smaller than the previous 12LP one and requires 25% less power at the same voltage when both operate at 1.0GHz.

Lots of Work to Reduce Voltage

To further reduce power, GF focused on operating voltage. For any node, an important challenge is to manage manufacturing variation in the transistors. Minor differences in the shape, thickness, or doping of the gate and channel alter the transistor’s work function, a measure of the energy required to move an electron through the material. The work function modifies the threshold voltage, which determines when a transistor switches state. For a given process, the foundry sets the operating voltage high enough to ensure that all transistors on the die reliably switch, meaning it must exceed the worst-case threshold voltage.

To address this challenge, 12LP+ adds dual-work-function transistors. The company backported this technology, originally developed for its 7nm process, into the 12nm node. The new design dopes the NMOS and PMOS transistors differently to better balance the work functions between them. This approach, which has a small cost adder, greatly reduces the required margin, bringing the SRAM operating voltage from 0.7V in 12LP down to 0.55V in 12LP+ for a 1.0GHz target frequency. The 12LP logic has a nominal voltage of 0.8V and an underdrive voltage of 0.7V, but in 12LP+, it can also operate at 0.55V. Because power follows the square of the voltage, these changes can halve the power consumption.

SRAM is the main power consumer, so GF focused on developing a low-voltage memory cell. Test chips show the new LVSRAM has yields greater than 95% even at 0.45V, meaning the design has plenty of margin at 0.55V. To benefit logic functions, the foundry commissioned Arm’s physical-intellectual-property (physical-IP) group to create a complete library of low-voltage standard cells for the 12LP+ process. This library, slated for September availability, allows customers to construct a complete AI accelerator that operates the SRAM and MAC units at 0.55V.

The total power savings of the new technologies is dramatic. GF simulated the power for a systolic array of MAC units—a common arrangement for CNN acceleration. The simulation reads the weights and activations (shown in Figure 2 as SRAM power), moves the data through the systolic array (transfer), and performs the computation (MAC). Relative to the base design, the new MAC unit and latched SRAM reduce total energy by more than a third, although the transfer energy remains the same. Operating at 0.55V produces a big drop across the board, bringing the total savings to 68% for this design.

Figure 2. Energy reduction in 12LP+. In a typical systolic MAC array, the new SRAM and MAC designs decrease total power by a third, and cutting the operating voltage takes off another third relative to the previous 12LP technology. (Data source: GlobalFoundries)

As usual, GlobalFoundries supports its 12LP+ process with a broad library of physical components, including digital, analog, and passive devices. It provides EDA tools such as Cadence and Synopsys plug-ins, Spice models, design-rule checkers, timing models, and place-and-route capability. To improve yield, it offers a complete design-for-manufacturability (DFM) flow. The foundry has recharacterized its 12LP physical IP, including memory and I/O interfaces, for 12LP+. In addition to Arm’s low-voltage standard-cell library, third-party IP vendors such as Rambus and Synopsys also support 12LP+.

Powering the AI Leaders

The new technology builds on the proven success of GF’s 12LP process, which powers industry-leading AI products. For example, Silicon Valley startup Groq has developed a new architectural approach to accelerating neural networks that combines hundreds of function units in a single core. The massive design includes 220MB of SRAM and more than 200,000 MAC units (see MPR 1/6/20, “Groq Rocks Neural Networks”). Groq adopted 12LP to keep such a large design within a 300W power budget. At an initial speed of 1.0GHz, the chip achieves a peak throughput of 820 trillion operations per second (TOPS) for INT8 data, surpassing all other announced accelerators.

Tenstorrent, a Canadian startup, also accelerates inferencing but chose a different design target: the 75W power limit for a bus-powered PCIe card. Its first chip features 120 independent cores that each include 1MB of SRAM and about 500 MAC units. This approach still requires lots of SRAM and MAC units. At a preliminary speed of 1.3GHz, the chip can deliver 368 TOPS (see MPR 4/13/20, “Tenstorrent Scales AI Performance”). The 12LP technology helps Tenstorrent achieve 4.9 TOPS per watt, the best efficiency rating among data-center products, as Figure 3 shows.

Figure 3. Comparison of high-end AI accelerators. Groq’s TSP provides greater performance (measured in trillions of operations per second, or TOPS) than Nvidia’s new A100 while using less power. Tenstorrent targets lower performance but delivers three times the power efficiency (TOPS per watt) of the A100. (Data source: vendors)

Nvidia, which has the greatest share in this market, recently delivered its A100 accelerator based on the new Ampere architecture. Ampere introduces many innovative features and boosts peak performance to 624 TOPS, beating every announced chip except Groq’s. But despite a shrink to 7nm technology, the A100 requires 400W TDP, 33% higher than the previous 12nm product. To fit even this increased power budget, Nvidia had to reduce the clock speed relative to the 12nm product and disable 15% of the cores on the die, an unusual tactic that could indicate the chip’s power was considerably higher than simulated (see MPR 6/8/20, “Nvidia A100 Tops in AI Performance”). As a result, the A100 badly lags the Groq and Tenstorrent chips in performance per watt, despite its smaller transistors.

One advantage of the 7nm TSMC process is that it doubles transistor density relative to the 12nm GF process, enabling Nvidia to pack more than 50 billion transistors into the A100. To help its customers compete in this regard, GF supports various chiplet approaches. The company’s extensive experience in multidie packaging includes 2.5D silicon interposer designs that feature high-bandwidth memory (HBM). For 3D chip stacking, it has developed hybrid wafer bonding (HWB) technology that uses through-silicon vias (TSVs) with a 5.76-micron pitch and a roadmap to greater density. For low-density interconnects, customers can build a chiplet configuration on inexpensive organic substrates, similar to AMD’s Rome processor. Any of these chiplet approaches can enable a high transistor count without moving to 7nm.

Better Than 7nm

For its 7nm technology, TSMC claims a clock-speed gain of up to 20% and a power reduction of up to 40% relative to its 10nm node (see MPR 5/20/19, “EUV Processes Reach Mass Production”). But these best-case numbers assume lightly loaded transistors; complex processor designs, which are typically limited by metal capacitance instead of transistor speed, achieve half of these gains or less. Nvidia’s 7nm A100, as noted, is slower than its 12nm predecessor, and Qualcomm’s first 7nm processor, the Snapdragon 855, increases the maximum CPU speed by just 2% over the Snapdragon 845. TSMC expects 5nm’s benefits will be smaller than 7nm’s even as greater EUV use raises per-wafer and tapeout costs.

GlobalFoundries’ 12LP+ provides an alternate path, offering much larger power reductions than TSMC’s 7nm without the added cost. Much of the reduction comes from the new dual-work-function transistors, which enable a 0.55V option. For 7nm, TSMC provides ultra-low-VT (ULVT) transistors that can operate on as little as 0.6V. That foundry has long focused on low-voltage operation for its smartphone customers, whereas GF has been more PC focused until recently, so this gain is largely closing a gap.

The remainder of the 12LP+ advantage comes from the technology’s AI-specific SRAM and MAC units. This approach reflects the foundry’s differentiation: whereas TSMC must serve a broad range of customers, GF can focus on certain emerging workloads. The AI segment is particularly fruitful because so many companies, particularly startups, are developing CNN accelerators. Large customers typically design their own caches and MAC units, but GF’s designs are useful for startups that want to minimize development cost while focusing on their unique architectures.

The longer-term question is whether GF can remain competitive without a roadmap to 7nm and beyond. TSMC’s 5nm technology is in volume production, and customers have already started designs in future nodes. These advanced processes enable designers to pack more memory and MAC units into their chips. Large companies with the biggest market share will continue down this path. Smaller companies targeting the AI market will find 12LP+ much more affordable, and they can use chiplets to cost-effectively increase transistor count. Groq and Tenstorrent have achieved leading AI performance with GlobalFoundries’ 12LP technology, and the AI enhancements in 12LP+ will make the new technology even better.

Price and Availability

GlobalFoundries’ 12LP+ technology is available for design starts; we expect volume production to begin in 2H21. For more information, access www.globalfoundries.com/news-events/press-releases/optimized-ai-accelerator-applications-globalfoundries-12lp-finfet

Free Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products
Subscribe to our Newsletter »

Events

Linley Fall Processor Conference 2021
Coming October 20-21, 2021
Hyatt Regency Hotel, Santa Clara, CA
Linley Spring Processor Conference 2021
April 19 - 23, 2021
Proceedings Available
More Events »