![]() |
A Guide to Processors for Deep Learning Third Edition Published February 2020 Authors: Linley Gwennap and Mike Demler Corporate License: $5,995 |
Take a Deep Dive into Deep Learning
Deep learning, also known as artificial intelligence (AI), has seen rapid changes and improvements over the past few years and is now being applied to a wide variety of applications. Typically implemented using neural networks, deep learning powers image recognition, voice processing, language translation, and many other web services in large data centers. It is an essential technology in self-driving cars, providing both object recognition and decision making. It is even moving into client devices such as smartphones and embedded (IoT) systems.
Even the fastest CPUs are inadequate to efficiently execute the highly complex neural networks needed to address these advanced problems. Boosting performance requires more specialized hardware architectures. Graphics chips (GPUs) have become popular, particularly for the initial training function. Many other hardware approaches have recently emerged, including DSPs, FPGAs, and dedicated ASICs. Although these solutions promise order-of-magnitude improvements, GPU vendors are tuning their designs to better support deep learning.
Autonomous vehicles are an important application for deep learning. Vehicles don't implement training but instead focus on the simpler inference tasks. Even so, these vehicles require very powerful processors, but they are more constrained in cost and power than data-center servers, requiring different tradeoffs. Several chip vendors are delivering products specifically for this application; some automakers are developing their own ASICs instead.
Large chip vendors such as Intel and Nvidia currently generate the most revenue from deep-learning processors. But many startups, some well funded, have emerged to develop new, more customized architectures for deep learning; Cerebras, Graphcore, Greenwaves, Gyrfalcon, Groq, Habana (now part of Intel), and Horizon Robotics are among the first to deliver products. Eschewing these options, leading data-center operators such as Alibaba, Amazon, Google and Microsoft have developed their own hardware accelerators.
We Sort Out the Market and the Products
A Guide to Processors for Deep Learning covers hardware technologies and products. The report provides deep technology analysis and head-to-head product comparisons, as well as analysis of company prospects in this rapidly developing market segment. Which products will win designs, and why. The Linley Group’s unique technology analysis provides a forward-looking view, helping sort through competing claims and products.
The guide begins with a detailed overview of the market. We explain the basics of deep learning, the types of hardware acceleration, and the end markets, including a forecast for both automotive and data-center adoption. The heart of the report provides detailed technical coverage of announced chip products from AMD, Cerebras, Graphcore, Groq, Gyrfalcon, Horizon Robotics, Intel (including former Altera, Habana, Mobileye, and Movidius), Mythic, Nvidia (including Tegra and Tesla), Wave Computing, and Xilinx. Other chapters cover Google’s TPU family of ASICs and Tesla’s autonomous-driving ASIC. We also include shorter profiles of numerous other companies developing AI chips of all sorts, including Alibaba, Blaize, Brainchip, Huawei, Lattice, and Syntiant. Finally, we bring it all together with technical comparisons in each product category and our analysis and conclusions about this emerging market.
Make Informed Decisions
As the leading vendor of technology analysis for processors, The Linley Group has the expertise to deliver a comprehensive look at the full range of chips designed for a broad range of deep-learning applications. Principal analyst Linley Gwennap and senior analyst Mike Demler use their experience to deliver the deep technical analysis and strategic information you need to make informed business decisions.
Whether you are looking for the right processor for an automotive application, an IoT device, or a data-center accelerator, or seeking to partner with or invest in one of these vendors, this report will cut your research time and save you money. Make the smart decision: order A Guide to Processors for Deep Learning today.
This report is written for:
- Engineers designing chips or systems for deep learning or autonomous vehicles
- Marketing and engineering staff at companies that sell related chips who need more information on processors for deep learning or autonomous vehicles
- Technology professionals who wish an introduction to deep learning, vision processing, or autonomous-driving systems
- Financial analysts who desire a hype-free analysis of deep-learning processors and of which chip suppliers are most likely to succeed
- Press and public-relations professionals who need to get up to speed on this emerging technology
This market is developing rapidly — don't be left behind!
The third edition of A Guide to Processors for Deep Learning covers dozens of new products and technologies announced in the past year, including:
- The TSP from startup Groq, the first chip to reach 1,000 TOPS
- Huawei’s Ascend family of accelerators, which ranges from edge to the data center
- Google’s first chip product, the tiny Edge TPU
- The massive Wafer-Scale Engine (WSE) from Cerebras
- Syntiant’s NDP10x chip, which performs voice recognition using less than one milliwatt
- The HanGuang ASIC from Alibaba, which achieved a record ResNet-50 score
- GreenWaves’ second-generation IoT processor, the GAP9
- Tesla’s ASIC, shipping in all of its new cars, sets automotive TOPS record
- Habana’s Gaudi accelerator for AI training
- Horizon Robotics’ vision accelerators for automotive designs
- Lattice’s SenseAI technology, which turns its low-power FPGAs into AI accelerators
- Intel’s Cascade Lake server processors with AI-inference extensions
- The Kunlun ASIC from Chinese cloud giant Baidu
- GrAI Matter Labs’s GrAI One, a low-power neuromorphic processor
- Startup Esperanto’s Maxion architecture
- Habana’s acquisition by Intel for $2 billion
- A first look at Qualcomm’s Cloud AI program
- Other new AI vendors such as Achronix, Cornami, Flex Logix, Hailo, Knowles, and Synaptics
- New product roadmaps and other updates on all vendors
- Updated market size and forecast to include 2019 slowdown in cloud spending and slower progress in autonomous driving
Deep-learning technology is being deployed or evaluated in nearly every industry in the world. This report focuses on the hardware that supports this AI revolution. As demand for the technology grows rapidly, we see opportunities for deep-learning accelerators (DLAs) in three general areas: the data center, automobiles, and embedded (edge) devices.
Large cloud-service providers (CSPs) can apply deep learning to improve web search, language translation, email filtering, product recommendations, and voice assistants such as Alexa, Cortana, and Siri. Data-center DLAs exceeded $3 billion in 2019 revenue and in five years will approach $10 billion. By 2024, we expect nearly half of all new servers (and most cloud servers) to include a DLA.
Deep learning is critical to the development of self-driving cars. Level 2 ADAS functions such as autonomous emergency braking already appear in more than half of new cars in the US and Europe; we forecast 70% worldwide adoption by 2024. Level 3–4 autonomy is taking longer than anticipated to reach the market, but we expect robotaxis and other commercial vehicles to deploy in the next few years, generating more than $2 billion in processor revenue in 2024.
To improve latency and reliability for voice and other cloud services, edge products such as smartphones, drones, smart speakers, security cameras, and Internet of Things (IoT) devices are implementing neural networks. We expect 1.6 billion edge devices to ship with DLAs in 2024.
This rapid market growth has spurred many new companies to develop chips with DLAs. This report covers more than 40 vendors, including those developing chips only for internal use; we’re aware of many others that have disclosed too little information to cover. Although this situation is reminiscent of previous booms in graphics accelerators and network processors, we expect a greater number of winners in this competition. Given the widely differing product requirements in the data center, automotive, and various embedded market segments, different companies are likely to take the lead in each.
Nvidia dominates the data-center market with its Volta and Turing GPUs, which include “tensor cores” that greatly improve their performance for neural-network training and inference. Customers prefer the company’s broad and reliable software stack. But it didn’t release a new GPU in 2019, leaving it more vulnerable to competition. Nvidia also leads the push to develop autonomous vehicles; its Xavier processor is the industry’s first single-chip solution for Level 3 autonomy, and its Drive AGX cards deliver even greater capabilities.
Intel offers several DLA architectures. Its standard Xeon CPUs are often used for inference, and the new Cascade Lake models triple inference throughput. After spending $2 billion to acquire Habana’s high-end DLA chips for training and inference, the company terminated its similar Nervana products. Intel also sells a range of FPGAs for customers that wish to design their own DLA architecture. Its low-power Myriad chips target drones and other camera-based devices. In addition, its Mobileye subsidiary is a leader in ADAS and is moving up to Level 3 and above.
Several others offer data-center DLAs. Cerebras, Graphcore, Groq, and other startups are sampling or shipping chips that use new architectures to outperform Nvidia’s GPUs on at least some workloads. AMD supplies Radeon Instinct GPUs, but they lack AI-specific features and fall well behind for neural networks. Xilinx added AI cores to its new Versal FPGAs but hasn’t disclosed any neural-network benchmarks for the pre-production chips. These vendors must also compete against in-house inference ASICs at Alibaba, Amazon, Baidu, Google, Microsoft, and other top CSPs, all of which are now deploying these devices. Huawei disclosed its own AI architecture that it sells in servers and as a cloud service. Each of these companies offers a limited software stack and thus can target only a few workloads.
Established automotive suppliers such as NXP, Renesas, and Toshiba compete against Mobileye in the ADAS market. Well-funded Chinese startups such as Black Sesame and Horizon Robotics have released ADAS processors and are developing more-powerful chips for autonomous driving. Other startups also target this market, including Blaize (formerly ThinCI), Hailo, and Kalray. The slowdown in autonomous deployment, as well as the need to meet stringent safety standards, means these companies will take years to generate significant automotive revenue. Some automakers are developing their own chips for autonomous vehicles, but only Tesla has disclosed any details on its ASIC, which is already in production.
The embedded market has attracted the most startups, as the cost of both hardware and software development is relatively low, and design wins can quickly generate revenue. This market encompasses multiple end applications. Gyrfalcon and NovuMind offer high-performance DLA chips for consumer video applications. Smart speakers and other voice-activated devices require low power and performance; Syntiant supplies the lowest-power chip for keyword spotting, but it competes against Ambient, Knowles, and Synaptics. For ultra-low-power sensors, BrainChip and Grai Matter use neuromorphic technology to reduce power, while Eta Compute, GreenWaves, and Lattice provide alternatives.
Comparing the capabilities of such products is complicated; much depends on the needs of the end application. This report provides the data necessary to evaluate these companies and their products, along with our analysis of how well they address market requirements.
List of Figures |
List of Tables |
About the Authors |
About the Publisher |
Preface |
Executive Summary |
1 Deep-Learning Applications |
What Is Deep Learning? |
Cloud-Based Deep Learning |
Advanced Driver-Assistance Systems |
Autonomous Vehicles |
Voice Assistants |
Smart Cameras |
Manufacturing |
Robotics |
Financial Technology |
Health Care and Medicine |
2 Deep-Learning Technology |
Artificial Neurons |
Deep Neural Networks |
Spiking Neural Networks |
Neural-Network Training |
Training Spiking Neural Networks |
Pruning and Compression |
Neural-Network Inference |
Quantization |
Neural-Network Development |
Neural-Network Models |
Natural-Language Models |
3 Deep-Learning Accelerators |
Accelerator Design |
Data Formats |
Computation Units |
Dot Products |
Systolic Arrays |
Handling Sparsity |
Other Common Functions |
Processor Architectures |
CPUs |
GPUs |
DSPs |
Custom Architectures |
FPGAs |
Performance Measurement |
Peak Operations |
Neural-Network Performance |
MLPerf Benchmarks |
AI-Benchmark |
4 Market Forecast |
Market Overview |
Data Center and HPC |
Market Size |
Market Forecast |
Automotive |
Market Size |
Market Forecast |
Autonomous Forecast |
Client and IoT |
Market Size |
Market Forecast |
5 AMD |
Company Background |
Key Features and Performance |
Conclusions |
6 Cerebras |
Company Background |
Key Features and Performance |
Conclusions |
7 Google |
Company Background |
Key Features and Performance |
Data-Center TPUs |
Edge TPU |
Conclusions |
8 Graphcore |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
9 Groq |
Company Background |
Key Features and Performance |
Conclusions |
10 Gyrfalcon |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
11 Habana (Intel) |
Company Background |
Key Features and Performance |
Conclusions |
12 Horizon Robotics |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
13 Intel |
Company Background |
Xeon Processors |
Key Features and Performance |
Product Roadmap |
Nervana Accelerators |
Key Features and Performance |
Stratix and Agilex FPGAs |
Key Features and Performance |
Product Roadmap |
Movidius Myriad |
Key Features and Performance |
Product Roadmap |
Conclusions |
14 Mobileye (Intel) |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
15 Mythic |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
16 Nvidia Tegra |
Company Background |
Key Features and Performance |
Software Development |
Product Roadmap |
Conclusions |
17 Nvidia Tesla |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
18 Tesla (Motors) |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
19 Wave Computing |
Company Background |
Key Features and Performance |
Product Roadmap |
Conclusions |
20 Xilinx |
Company Background |
Key Features and Performance |
UltraScale+ |
Alveo |
Versal |
Conclusions |
21 Other Automotive Vendors |
Black Sesame |
Blaize |
Fabu |
Hailo |
Company Background |
Key Features and Performance |
Conclusions |
Kalray |
Company Background |
Key Features and Performance |
Conclusions |
NXP |
Company Background |
Key Features and Performance |
Conclusions |
Renesas |
Company Background |
Key Features and Performance |
Conclusions |
Toshiba |
Company Background |
Key Features and Performance |
Conclusions |
22 Other Data-Center Vendors |
Achronix |
Company Background |
Key Features and Performance |
Conclusions |
Alibaba |
Key Features and Performance |
Conclusions |
Amazon |
Baidu |
Centaur (Via) |
Company Background |
Key Features and Performance |
Conclusions |
Esperanto |
Furiosa |
Huawei |
Key Features and Performance |
Conclusions |
Marvell |
Microsoft |
Company Background |
Key Features and Performance |
Conclusions |
Qualcomm |
Company Background |
Key Features and Performance |
Conclusions |
SambaNova |
23 Other Embedded Vendors |
Ambient |
BrainChip |
Cornami |
Eta Compute |
Company Background |
Key Features and Performance |
Conclusions |
Flex Logix |
Company Background |
Key Features and Performance |
Conclusions |
Grai Matter |
Company Background |
Key Features and Performance |
Conclusions |
GreenWaves |
Company Background |
Key Features and Performance |
Conclusions |
Knowles |
Company Background |
Key Features and Performance |
Conclusions |
Lattice |
Company Background |
Key Features and Performance |
Conclusions |
NovuMind |
Company Background |
Key Features and Performance |
Conclusions |
Synaptics |
Company Background |
Key Features and Performance |
Conclusions |
Syntiant |
Company Background |
Key Features and Performance |
Conclusions |
24 Processor Comparisons |
How to Read the Tables |
Data-Center Training |
Architecture |
Interfaces |
Performance |
Summary |
Data-Center Inference |
Architecture |
Interfaces |
Performance |
Summary |
Automotive Processors |
CPU Subsystem |
Vision Processing |
Interfaces |
Summary |
Embedded Processors |
Performance |
Memory and Interfaces |
Summary |
Embedded Coprocessors |
Performance |
Memory and Interfaces |
Summary |
Ultra-Low-Power Processors |
Performance and Power |
Memory and Interfaces |
Summary |
25 Conclusions |
Market Summary |
Data Center |
Automotive |
Embedded |
Technology Trends |
Neural Networks |
Hardware Options |
Performance Metrics |
Vendor Summary |
Data Center |
Automotive |
Embedded |
Closing Thoughts |
Appendix: Further Reading |
Index |
Figure 1‑1. SAE autonomous-driving levels |
Figure 1‑2. Waymo autonomous test vehicle |
Figure 1‑3. GM’s autonomous-vehicle prototype |
Figure 1‑4. Various smart speakers |
Figure 1‑5. A smart surveillance camera |
Figure 1‑6. Processing steps in a computer-vision neural network |
Figure 1‑7. Robotic arms use deep learning |
Figure 1‑8. Example biopsy images used to diagnose breast cancer |
Figure 2‑1. Neuron connections in a biological brain |
Figure 2‑2. Model of a neural-network processing node |
Figure 2‑3. Common activation functions |
Figure 2‑4. Model of a four-layer neural network |
Figure 2‑5. Spiking effect in biological neurons |
Figure 2‑6. Spiking-neural-network pattern |
Figure 2‑7. Pruning a neural network |
Figure 2‑8. Mapping from floating-point format to integer format |
Figure 3‑1. Common AI data types and approximate data ranges |
Figure 3‑2. Arm dot-product operation |
Figure 3‑3. A systolic array |
Figure 3‑4. Performance versus batch size |
Figure 4‑1. Revenue forecast for deep-learning chips, 2016–2024 |
Figure 4‑2. Unit forecast for deep-learning chips, 2016–2024 |
Figure 4‑3. Unit forecast for ADAS-equipped vehicles, 2016–2024 |
Figure 4‑4. Revenue forecast for ADAS processors, 2016–2024 |
Figure 4‑5. Unit forecast for client deep-learning chips, 2016–2024 |
Figure 6‑1. Cerebras wafer-scale engine (WSE) |
Figure 7‑1. Google TPUv2 board |
Figure 8‑1. Graphcore C2 card |
Figure 9‑1. TSP conceptual diagram |
Figure 11‑1. Functional diagram of Habana HLS-1 system |
Figure 13‑1. Intel NNP-I chip in an M.2 card |
Figure 15‑1. Mythic's flash-based neural-network tile |
Figure 21‑1. Hailo-8 heterogeneous-resource map |
Figure 21‑2. Block diagram of NXP S32V234 |
Figure 21‑3. Block diagram of Renesas R-Car V3H |
Figure 21‑4. Block diagram of Toshiba TMPV770 ADAS processor |
Figure 22‑1. Block diagram of DSP core in Achronix Speedster7t |
Figure 22‑2. Block diagram of Alibaba HanGuang 800 |
Figure 22‑3. Block diagram of Centaur CHA processor |
Figure 23‑1. Block diagram of Eta Compute ECM3531 |
Figure 23‑2. Block diagram of Flex Logix InferX X1 |
Figure 23‑3. Block diagram of GreenWaves GAP8 processor |
Figure 23‑4. Block diagram of Knowles AISonic processor |
Figure 23‑5. Block diagram of Lattice SensAI architecture for the ECP5 |
Figure 23‑6. Block diagram of Synaptics AudioSmart AS-371 |
Figure 23‑7. Block diagram of Syntiant NDP101 speech processor |
Figure 24‑1. ResNet-50 v1.0 training throughput |
Figure 24‑2. ResNet-50 v1.0 inference throughput |
Figure 24‑3. ResNet-50 v1.0 inferences per watt |
Figure 24‑4. ResNet-50 v1.0 inference latency |
Figure 25‑1. Model size trend, 2012–2019 |
Figure 25‑2. Deep-learning accelerators |
Table 2‑1. Size and compute requirement of popular DNNs |
Table 4‑1. Data-center DLA units and revenue, 2018–2024 |
Table 5‑1. Key parameters for AMD Radeon Instinct accelerators |
Table 7‑1. Key parameters for Google TPU accelerators |
Table 7‑2. Key parameters for Google Edge TPU |
Table 8‑1. Key parameters for Graphcore GC2 processor |
Table 9‑1. Key parameters for Groq TSP architecture |
Table 10‑1. Key parameters for Gyrfalcon Lightspeeur coprocessors |
Table 11‑1. Key parameters for Habana Goya accelerator card |
Table 12‑1. Key parameters for Horizon Robotics processors |
Table 13‑1. Key parameters for selected Intel Cascade Lake processors |
Table 13‑2. Key parameters for Intel Nervana processors |
Table 13‑3. Key parameters for selected Intel Stratix 10 GX FPGAs |
Table 13‑4. Key parameters for Intel Movidius processors |
Table 14‑1. Key parameters for Mobileye EyeQ processors |
Table 16‑1. Key parameters for Nvidia automotive processors |
Table 17‑1. Key parameters for Nvidia deep-learning GPUs |
Table 18‑1. Key parameters for Tesla FSD ASIC |
Table 19‑1. Key parameters for Wave DPU accelerators |
Table 20‑1. Key parameters for selected Xilinx FPGAs |
Table 21‑1. AI-chip companies targeting automotive applications |
Table 21‑2. Key parameters for Kalray MPPA3 processor |
Table 22‑1. AI-chip companies targeting data-center applications |
Table 22‑2. Key parameters for Huawei Ascend 910 accelerator card |
Table 22‑3. Key parameters for Microsoft Brainwave accelerator |
Table 23‑1. AI-chip companies targeting embedded applications |
Table 24‑1. Comparison of CPUs and GPUs for AI training |
Table 24‑2. Comparison of accelerators for AI training |
Table 24‑3. Comparison of high-end DLAs for AI inference |
Table 24‑4. Comparison of midrange DLAs for AI inference |
Table 24‑5. Comparison of ADAS processors |
Table 24‑6. Comparison of autonomous-vehicle processors |
Table 24‑7. Comparison of embedded DLA SoCs |
Table 24‑8. Comparison of embedded DLA coprocessors |
Table 24‑9. Comparison of low-power processors for deep learning |