» Current | 2020 | 2019 | 2018 | Subscribe

Linley Newsletter

Groq Rocks Neural Networks

January 7, 2020

Author: Linley Gwennap

Groq has taken an entirely new architectural approach to accelerating neural networks. Instead of creating a small programmable core and replicating it dozens or hundreds of times, the startup designed a single enormous processor that has hundreds of function units. This approach greatly reduces instruction-decoding overhead, enabling the initial Tensor Streaming Processor (TSP) chip to pack 220MB of SRAM while computing more than 400,000 integer multiply-accumulate (MAC) operations per cycle. The result is performance of up to 1,000 trillion operations per second (TOPS), four times faster than Nvidia’s best GPU. Initial ResNet-50 results show a similar advantage. The chip can also handle floating-point data, allowing it to perform both inference and training. The startup is now sampling the TSP, and we expect production shipments to begin around midyear.

Even general-purpose processors have long since abandoned large monolithic CPUs in favor of multicore designs. One challenge with creating a physically large CPU is that clock skew makes it difficult to synchronize operations. Groq instead allows instructions to ripple across the processor, executing at different times in different units. This technique simplifies the design and routing, eliminates the need for synchronization, and is easily scalable.

Within each lane, data flows horizontally. In fact, the TSP continually pushes it across the chip on every clock cycle. Again, this simplifies the routing and allows a natural flow of data during neural-network calculations. Memory is embedded with the function units, providing a high-bandwidth data source and eliminating the need for external memory. The result is somewhat like a systolic array of heterogeneous function units, but the data only moves horizontally while the instructions move vertically.

Because the architecture lacks caches, branch prediction, and similar mechanisms, it’s fully deterministic. The software tools accept models developed in TensorFlow, and the company is developing drivers for other popular frameworks.

Subscribers can view the full article in the Microprocessor Report.

Subscribe to the Microprocessor Report and always get the full story!

Purchase the full article

Free Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products
Subscribe to our Newsletter »


Linley Fall Processor Conference 2020
October 20-22 and 27-29, 2020 (All Times Pacific)
Register Now!
More Events »