Microprocessor Report (MPR) Subscribe

Ceva and Synopsys Spin More TOPS

IP Vendors Roll Out Multicore Deep-Learning Accelerators

October 7, 2019

By Mike Demler


The rapid adoption of machine learning is driving IP vendors to compete by scaling up performance in each new generation of licensable deep-learning accelerators (DLAs). Last year, DLAs integrating up to 4,096 multiply-accumulate (MAC) units per core set the standard for licensable inference engines. But advanced driver assistance systems (ADASs) and other high-performance edge devices demand even greater performance, so with their latest products, Ceva and Synopsys aim to shatter that mark.

Ceva recently announced its second-generation NeuPro-S. The company didn’t change the core NeuPro engine architecture, which can run all common neural-network layers, but it added a new multilevel memory subsystem (MSS) that allows multiple cores to work together sharing the same on-chip memory. An engine integrating 4,096 MAC units delivers up to 12.7 trillion operations per second (TOPS). NeuPro-S customers typically aggregate up to eight engines comprising a total of 32K MAC units, delivering up to 100 TOPS, but there is no hard limit. That performance is comparable to that of some data-center DLA chips, but NeuPro targets edge applications.

Along with the new multicore feature, Ceva is developing an API called CDNN-Invite that enables customers to combine their own custom accelerators in a heterogeneous configuration with the NeuPro cores. Several customers have already licensed NeuPro-S, but the company is restricting CDNN-Invite to lead customers, with a general release planned by the end of the year. 

Synopsys’s has enhanced its DesignWare ARC EV6x DLAs, which allow customers to connect multiple accelerator arrays, each integrating 880 MAC units along with dedicated blocks for activations and fully-connected layers. The EV6x models support up to four arrays comprising 3,520 MAC units, but the new EV7x quadruples that to 16 arrays comprising a total of 14,080 MAC units.

Along with supporting more MAC units, the EV7x includes enhancements that improve the engine’s accuracy, bandwidth, and MAC utilization. At a 1.25GHz clock frequency, the EV7x delivers up to 35 TOPS. To match Ceva’s multicore capabilities, customers can connect multiple EV7x instances with shared cluster memory using an AXI bus or a network-on-chip (NoC). In 1Q20, the company plans to release the 14,080-MAC model to lead customers, but the EV7x model with 3,520-MAC units is available for general licensing now.

Something Old and Something New

The NeuPro-S engine is mostly the same as its predecessor (see MPR 1/29/18, “Ceva NeuPro Accelerates Neural Nets”). As Figure 1 shows, it integrates a configurable number of MAC units in a convolution array, adding special-purpose units for activation functions, pooling layers, and scaling. The MAC units execute single-cycle 8x8-bit operations, but the array can also perform 16x8-bit MACs at half that rate or 16x16-bit MACs at a quarter of that rate. Programmers can mix 8-bit and 16-bit quantization, setting the precision independently for each network layer. Like its predecessor, NeuPro-S allows customers to select among 1,024-, 2,048-, and 4,096-MAC configurations for each engine, but the new model omits the 512-MAC option.

 

Figure 1. Ceva NeuPro-S deep-learning accelerator. The second-generation architecture allows customers to combine XM6 DSPs with the NeuPro-S engine in multicore configurations. The new CDNN-Invite API enables integration of custom neural-network accelerators that run in the same computational graph with the NeuPro cores.

For its second-generation NeuPro, the company added weight-compression features to its core and the CDNN compiler. During compilation of pretrained networks, the software reduces storage by removing zero-valued weights. It also offers an option that further reduces DRAM bandwidth and storage requirements by employing weight sharing. The company withheld details of its algorithm, but because it likely employs probabilistic rounding to reduce the total number of weights, sharing requires users to trade the bandwidth and storage savings for an unspecified loss in accuracy.

In an example neural network, compressing 8-bit weights to a smaller set of 5-bit weights yields a 2x bandwidth reduction. Sacrificing more accuracy by reducing weights to just 2 bits increases the compression ratio to 3.6x. During runtime, the NeuPro-S engine uses a new weight-decompression manager (WDM) to expand the weights. To optimize the weight quantization, CDNN enables users to retrain the NeuPro-S engine offline, a capability the first-generation model also supports.

For multicore configurations, the new hierarchical on-chip memory reduces off-chip DRAM transactions. As in the first edition NeuPro, the engine allows configurable level-1 (L1) data and weight memories closely coupled to the core, but the upgraded architecture also supports an optional second-level (L2) memory. The L2 can operate as a cache or SRAM. Connecting the L2 to the AXI bus enables multicore NeuPro-S configurations to share storage.

NeuPro-S increases the bandwidth of multicore configurations by doubling the AXI-bus width to 256 bits. For designs manufactured using a 16nm technology, a single NeuPro-S engine running at 1.5GHz achieves up to 12.5 TOPS, but connecting four dual-core clusters delivers 100 TOPS.

Supporting Diversity

The first-generation NeuPro uses a vector-processor unit (VPU) to control DLA operations. It runs an RTOS, and users can program it to handle neural-network layers that the NeuPro engine’s compute units don’t support. The generically named VPU implements an ISA optimized for neural-network processing, but its eight-way VLIW and 14-stage pipeline is the same as Ceva’s XM6 DSP (see MPR 10/10/16, “Ceva XM6 Accelerates Neural Nets”), one of the company’s previous-generation inference engines.

For the second-generation model, Ceva reverted to the XM6 branding. The DSP comprises four scalar units and three VPUs, enhancing support for systems that combine DSP functions with neural-network inference. Examples include advanced driver-assistance systems (ADASs) and robots that navigate using computer vision and simultaneous location and mapping (SLAM) algorithms (see MPR 6/10/19, “DSP-IP Vendors Target 3D Navigation”). The DSP core features 128 MAC units that can run in parallel with the NeuPro Engine.

The CDNN-Invite API offers software engineers a unified programming model, and it enables Ceva’s customers to differentiate their designs by building diverse application-specific accelerators. Figure 2 shows the two options for using CDNN-Invite. Customers can combine their custom accelerators with a standalone XM6 DSP, or they can run a custom accelerator in concert with the NeuPro-S engine. The custom accelerator can implement individual neural-network layers, or it can run a completely separate network based on the licensee’s proprietary algorithms. The CDNN compiler optimizes the layers and networks for both configurations, and the scheduler integrates all the cores to form a heterogeneous accelerator.

 

Figure 2. CDNN-Invite usage models. Customers can employ the API to run their own deep-learning accelerators with the XM6 DSP, or they can use it to integrate a custom accelerator with the NeuPro-S engine in a heterogeneous configuration.

A Recurrent Theme

Although Synopsys’s new EV7x can also scale to 100 TOPS, its approach is different than Ceva’s. Whereas NeuPro offers three different MAC-array sizes for each engine, EV7x scales up by integrating multiple engines with fixed 880-MAC arrays, as Figure 3 shows. The predecessor EV6x supports up to 3,520 MAC units (see MPR 8/7/17, “Synopsys EV6x Serves Up Big MACs”), but the new design quadruples that to 14,080 MAC units, equivalent to sixteen 880-MAC arrays. The MAC units retain their predecessor’s 12-bit format, but they also support 8-bit operations at the same speed.

 

Figure 3. Synopsys EV7x deep-learning accelerator. The architecture allows designers to combine 1, 2, or 4 vision-processing engines, each including a 32-bit scalar core and 512-bit vector DSP, along with a configurable neural-network accelerator.

The EV7x engine accelerates convolutional neural networks (CNNs), and it offers new capabilities for running batched recurrent neural networks (RNNs), which are often used for voice recognition. It supports more nonlinear activation functions than its predecessor, including parametric rectified linear unit (PReLU), ReLU, sigmoid, tanh, and others. The company withheld details, but it says the EV7x also includes enhancements that improve the engine’s accuracy, bandwidth, and MAC utilization compared with the EV6x.

Designers can couple the EV7x DLA to one, two, or four vision engines, which are similar to NeuPro’s XM6 DSP. The EV7x vision engine integrates a 32-bit scalar CPU along with a 512-bit vector DSP. Each DSP includes 64 INT8 MAC units. As in the XM6, the vision engine can run custom neural-network layers, and it includes ISA and microarchitecture modifications that improve computer-vision and DSP performance compared with its predecessor. The new features include a higher-bandwidth 512-bit DMA unit, faster inner-loop DSP operations, latency reductions, and more scalar and vector registers. The DSP can run SLAM algorithms, and it offers an optional vector FPU (VFPU) that supports half- (FP16) and full-precision (FP32) SIMD operations. Running at 1.0GHz, the VFPU delivers 512 GFLOPS for FP16 operations.

The company supports the EV cores with its Metaware EV software. The compiler increases performance by employing layer fusion, combining operations that pass data between layers rather than storing intermediate data in SRAM (see MPR 6/17/19, “MediaTek AI Engine Earns Top Marks”). Running Inception v3 at 1.0GHz, the EV7x hardware and software enhancements enable the 3,520-MAC engine to deliver 65% more frames per second (fps) than its predecessor. Larger networks such as Yolo v2 see a smaller improvement, yielding a 16% fps gain. The company specifies that a single EV7x integrating 14,080 MAC units can deliver 35 TOPS, equivalent to a 1.25GHz clock frequency.

As in NeuPro, delivering 100 TOPS or more requires customers to connect multiple instances. Most automotive customers use a NoC with functional-safety features, such as Arteris’ Ncore (see MPR 4/24/17, “Arteris Ncore 2.0 Simplifies Safety”), but customers can also connect multiple EV7x cores using an AXI bus. The EV7x engines integrate closely coupled memories that customers can configure from 4KB to 16MB, but multiple cores connect to shared (L2) cluster memory, and designers can connect an L3 SRAM to the AXI bus.

The EV7x works with an optional AES-XTS encryption engine that protects data communications between the DLA and on-chip memory. Encryption protects sensitive user data such as biometrics as well as the DLA licensee’s neural-network graph topology and weight data. Power-management features include multiple voltage domains and power gating to minimize idle current. The real-time-trace module helps debug instruction execution and data flow, generating messages compliant with the IEEE’s Nexus 50001 Class 3 standard or in Arm’s CoreSight format.

More Similarities than Differences

Ceva and Synopsys each offer DLAs that scale to 100 TOPS or more. But efficiently mapping large neural networks to such a multicore processor is challenging, and such architectures won’t get close to 100% MAC-unit utilization on real networks. Nevertheless, EV7x and NeuPro-S include many similar features, as Table 1 shows. When it goes into production, the largest EV7x will include a massive 14K MAC array, but that configuration requires customers to integrate sixteen 880-MAC cores. NeuPro-S doesn’t impose a hard limit on the number of cores customers can include in a multicore configuration, and a single core can include up to 4,096 MAC units. Ceva’s product has a lead of more than two quarters, which enables customers to build higher-performance multicore configurations now.

 

Table 1. Comparison of Ceva and Synopsys DLAs. Both accelerators connect a vector DSP with one or more engines capable of running all common neural-network layers. A single NeuPro-S engine supports up to 4,096 MAC units, but designers can connect multiple engines in mul­ticore clusters. A single EV7x engine scales up by connecting multiple 880-MAC arrays. The largest EV7x integrates 16 arrays comprising 14,080 MAC units. *In a 16nm FinFET process under typical conditions. (Source: vendors)

A single NeuPro-S engine with 4,096 MAC units offers 40% greater peak performance per core than the currently available 3,520-MAC EV74, and we expect it delivers superior area efficiency for popular INT8 inference tasks as well, owing to Synopsys’s use of larger 12-bit MAC units. But TOPS specifications are poor indicators of actual inference-engine performance, so designers should always evaluate DLAs using the vendor’s compiler to run their particular model (see MPR 1/28/19, “AI Benchmarks Remain Immature”).

Driving Toward the Same Goal

By scaling up their inference engines, Synopsys’ EV7x and Ceva’s NeuPro-S both target ADASs and autonomous vehicles. Although self-driving cars can require 100 TOPS (and greater) inference engines, a growing trend for automotive OEMs is to add individual DLAs to each camera, radar, and other sensor; these smaller DLAs are well suited to less ambitious NeuPro-S and EV7x configurations.

Both vendors have announced automotive design wins. Ceva’s licensees include On Semi, which employs NeuPro in its ADAS-camera products. The IP vendor says the first NeuPro-S customers also include an automotive OEM but withheld that customer’s identity. Ceva supports NeuPro with an ISO 26262 safety package, and its products also comply with the Automotive Spice (Software Process Improvement and Capability Determination) and IATF 16949 quality standards.

Synopsys’s automotive customers include Arbe Robotics, a manufacturer of high-resolution radar sensors. Concurrent with the EV7x rollout, Infineon announced plans to employ DesignWare ARC EV IP in its Aurix 32-bit automotive microcontrollers. The Aurix chips are popular in safety-critical ADASs, such as automatic emergency-braking systems. The EV7x cores offer hardware features that enable ASIL B, C, and D certification, including transient-fault protection.

The EDA vendor’s MetaWare EV tools ease development of ISO 26262–compliant software. Synopsys plans to offer functional-safety (FS) enhanced models for a number of its cores, including an EV7xFS version with a hybrid ASIL B/D option. The hybrid feature allows customers to build their chip with an ASIL B model, but after tapeout they can switch to ASIL D–compliant operation, although with decreased performance, through a software change. Synopsys has the advantage of offering a larger ASIL-compliant IP catalog than Ceva, including ASIL B and ASIL D models for its ARC EM22 and HS4x CPUs.

Both companies compete with a long list of DLA-IP vendors (see MPR 1/7/19, “IP Suppliers Push AI Accelerators”). The closest competitor is automotive specialist AImotive, which offers a multicore accelerator called AIware3 (see MPR 11/5/18, “AIware3 Adds Rings to Neural Engine”). But AIWare requires a host CPU, and to reach 80 TOPS, customers must employ a 7nm technology that can implement eight ring networks connecting multiple compute units. Like the NeuPro Engine, Cadence’s DNA 100 combines a 4,096-MAC engine with a Tensilica DSP core (see MPR 10/15/18, “Cadence Mutates Its DNA to Boost AI”), delivering 8 TOPS per core in 16nm technology. For greater performance, licensees can connect multiple cores on an AXI bus, but achieving 100 TOPS requires more than a dozen cores.

Although DLA customers can instantiate multiple instances of any licensable core, that approach is unlikely to work if the IP vendor hasn’t verified the configuration, and it requires a compiler capable of partitioning computational graphs across multiple cores. Ceva and Synopsys now offer multicore configurations that push performance to new levels, but neither has published benchmarks for real neural networks. The EV7x and NeuPro-S sport many similar features, so designers must carefully evaluate their designs by using each vendor’s software to benchmark prototypes prior to committing to custom silicon.

Price and Availability

Ceva and Synopsys don’t disclose pricing for their licensable cores. NeuPro-S is available for licensing now, and the CDNN-Invite API will be available for general licensing in 4Q19. More information on NeuPro-S is available at www.ceva-dsp.com/product/ceva-neupro.

The 3,520-MAC EV7x engine is available for licensing now. The 14,080-MAC model will be available for lead customers in 1Q20. For more information on DesignWare ARC EV7x, access www.synopsys.com/dw/ipdir.php?ds=ev7x-vision-processors.

Events

Linley Spring Processor Conference 2020
Coming April 6-9, 2020
NOW A VIRTUAL EVENT
Linley Fall Processor Conference 2020
Coming October 28-29, 2020
Hyatt Regency, Santa Clara, CA
More Events »

Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products
Subscribe to our Newsletter »