» Current | 2022 | 2021 | 2020
Linley Newsletter
Arm Dot Products Accelerate CNNs
July 10, 2018Author: Mike Demler
Arm's new dot-product instructions deliver up to a 4x performance boost to convolutional-neural-network (CNN) operations running on 64-bit Cortex-A CPUs. These networks are the most popular deep-learning architecture for image recognition and other machine-learning applications.
Analyzing an image can require billions of dot products, using multiply-accumulators (MACs) to process the pixel data through feature-extraction filters. For example, the popular ResNet-50 CNN requires 3.9 billion MAC operations per image. Designers can run CNNs on Cortex-A and Cortex-M CPUs, as well as Mali GPUs, using the neural-network libraries and APIs in Arm's Project Trillium software stack. Although large network models such as ResNet-50 are better suited to dedicated deep-learning accelerators (DLAs), the new dot-product instructions make a general-purpose CPU with SIMD capabilities sufficient for small networks.
Running some CNN operations on the CPU also allows it to work in parallel with a GPU or DLA, distributing the workload in a heterogeneous SoC. The new Neon instructions calculate four 4x4 dot products, accumulating the results in a 128-bit vector comprising the four 32-bit totals. These instructions were officially introduced as part of Arm v8.4, first implemented in the Cortex-A76 CPU. However, they were "accelerated" into the older Cortex-A55 and Cortex-A75, which are Arm v8.2 compliant.
These instructions allow Arm’s newest CPUs to deliver impressive neural-network performance. Running at a 2.4GHz clock frequency, for example, Cortex-A76 can achieve 614GOP/s (or 307GMAC/s), which is twice the peak performance of a small DLA such as Cadence's P6 and Ceva's XM4.
Subscribers can view the full article in the Microprocessor Report.
Subscribe to the Microprocessor Report and always get the full story!