Microprocessor Report (MPR) Subscribe

General Processor Commands IP Unity

Heterogeneous Platform Includes CNN Accelerator, CPU, and DSP

July 30, 2018

By Mike Demler


General Processor Technologies (GPT), a unique group of affiliated companies based in China and the U.S., has developed a platform of heterogeneous-processor intellectual property (IP) it calls Unity. The platform includes a 64-bit CPU, a configurable deep-learning accelerator (DLA), and a vector DSP. The constituent companies are also developing a general-purpose GPU. GPT plans to begin licensing the CPU, DLA, and DSP as hard and soft cores beginning in 4Q18, after it validates the designs in TSMC’s 28nm process.

The Unity CPU is a superscalar design that runs at up to 2.0GHz in 28nm technology. The Unity DLA integrates 288 multiply-accumulate units (MACs) that work with half-precision floating-point (FP16) data and run at up to 1.0GHz; they handle most convolutional-neural-network (CNN) operations. The variable-length-vector (VLVm1) DSP also runs at 2.0GHz, supporting INT16, INT32, and FP16 data.

The GP8300 image-recognition SoC will serve to validate the IP. GPT withheld architectural details, but the chip will combine the CNN accelerator with the Unity CPU and VLVm1 DSP in a heterogeneous cache-coherent architecture. The Unity 1.0 ISA supports a common programming view for the heterogeneous processor by mapping instructions from the Heterogeneous System Architecture Intermediate Language (HSAIL).

An Inscrutable Organization

Unraveling General Processor’s organization begins with its parent company, Hua Xia (pronounced “wha sha”) GPT, or HXGPT. This company is located in the Beijing Economic and Technological Development Zone (Beijing E-Town), a site the Chinese government established in 1994 to nurture the country’s high-tech industry. Some E-Town companies have attracted foreign investors, such as IBM and Intel. According to its public announcements, Hua Xia initially developed IP specifically for the Chinese market, but in 2016, it announced sampling of the first HSA-compliant Unity platform as well as worldwide expansion of its licensing program. It bills itself as “China’s only IP licensing company providing CPU, DSP, GPU, and AI accelerator cores,” but it has withheld information on the status of those earlier cores. We think this new announcement is a reboot.

Although Hua Xia has for several years employed “GPT” as the name of its U.S. operations, its only U.S.-based business is actually Optimum Semiconductor, the English operating name of Wuxi (pronounced “woo she”) DSP. But the GPT operations in China will run the IP-licensing business while Optimum focuses on SoC design. Optimum traces its origins to Sandbridge, a DSP developer formed in 2001 by a group of IBM Microelectronics engineers (see MPR 11/18/02, “Sandbridge Blasts Off at MPF”). John Glossner, a Sandbridge cofounder, is CEO. Mayan Moudgill, another cofounder, is CTO and chief architect.

In 2008, Sandbridge designed the SB3500 baseband processor using a software-defined architecture to enable dynamic reconfiguration for multiple communications protocols. The chip supports the CDMA, GSM, LTE, and WiMax cellular standards, as well as digital TV, GPS, and Wi-Fi. The company licensed it to Wuxi DSP, also now an HXGPT affiliate, for use in broadband multimedia-trunking systems, network terminals, set-top boxes, and a variety of wireless-communications products. In 2010, after investors dissolved Sandbridge, Wuxi hired most of its engineers, reorganizing the New York–based group as Optimum Semiconductor.

After the reorganization, Optimum designed Wuxi’s SB3500-based systems. It also collaborated with Hua Xia on the Unity IP cores and the GP8300 SoC. Wuxi DSP now describes itself as a chip company, but it continues to sell communications equipment. The Optimum team includes about 20 engineers who provide the industry experience to complement the technical expertise of the 300 engineers at Hua Xia’s Chinese design centers.

Floating a New Architecture

General Processor chose FP16 as the optimum resolution for its DLA, rather than the INT8 data type that more commonly serves in CNN inference engines, because FP16 maintains higher precision when converting networks trained with FP32 parameters. Half-precision floating point is less challenging than INT8 for designers that lack the neural-network expertise to adapt pretrained models. Because some other AI accelerators implement INT8, the Unity DLA can convert signed and unsigned INT8 data for use with the floating-point hardware.

As Figure 1 shows, the Unity DLA has four independent engines: input, filter, postprocessing, and output. Each one operates in parallel as a standalone processor with its own program counter. The engines integrate 16KB instruction and data memories along with a 16x32-bit local register file. The instructions are 32 bits wide, comprising the common operations that all the engines perform as well as task-specific CNN-layer operations, such as image-filter functions. Common instructions include add/subtract, control functions (e.g., jump), load/store, logical operations, shift, and register copy.

 

Figure 1. GPT’s deep-learning accelerator. The core includes four engines that can process CNN layers independently and in parallel. The filter engine integrates 288 FP16 MACs, and the postprocessing engine handles activations, pooling, and other non-convolution layers.

The core integrates a 16x32-bit global register file, 16x32-bit shadow registers, and a set of control registers by which the host CPU configures the accelerator. The engines communicate with each other through the global registers. The wait instruction uses the global registers to implement up to eight consumer-producer queues that coordinate data communications between engines.

The Unity DLA interacts with a host CPU through the global and shadow registers. While the accelerator is performing one task, the CPU can load the instructions for the next task in the shadow registers. After completing the current task, the accelerator automatically copies the shadow registers to the global registers, initiating execution of the next task. The chained operation avoids the lost cycles that result from waiting for interrupt-based communications, but the DLA supports that operating mode as well. To determine whether a task is complete, the host can poll the DLA status register. It also has full access to all local registers.

In TSMC’s 28nm technology, the core runs at 1.0GHz and uses 0.6mm2 of die area. The company integrates up to 2MB of SRAM as well as built-in self-test (BIST), expanding the complete accelerator die to 9.36mm2. Because the DLA core uses only 6% of the total area, GPT concluded that using smaller fixed-point hardware would be an unnecessary compromise. It estimates the power consumption when running at 1.0GHz is less than 250mW.

Memory transactions dominate the accelerator’s performance, as well as its area and power, so the GPT design optimizes coupling of the two components. The SRAM is organized as eight 256KB slices, each associated with four 3x3 filter units, for a total of 32 filter units per engine. The slices include 32x16KB banks that are addressable as 16-byte (128-bit) odd/even memories, enabling simultaneous access to 32 data bytes at aligned addresses. The block organization provides the filter engine with numerous ports, reducing the likelihood of memory-access collisions.

A Set of Specialized Engines

The Unity DLA’s input engine reads data from system memory and stores it in internal SRAM. The output engine does the reverse, writing local data back to system memory. The input and output engines can perform these tasks simultaneously, and if necessary, they convert signed and unsigned INT8 variables to or from the accelerator’s FP16 format. Read/write transactions can specify up to 16 consecutive transfers of up to 16 bytes each.

The filter engine is the CNN workhorse, executing matrix-multiplication dot products on nine FP16 input values. It comprises a 3x3 MAC, which multiples the data in register “x” (xJ) with the associated weight in register “a” (aJ), summing the nine dot products in register “y.” To implement larger filters, the accumulator can add partial sums from other function units. The designers chose a fixed 3x3 filter for the initial implementation, since it’s a popular size in open-source CNNs, but the filters are configurable. Larger filters require bigger input-data and weight registers, as well as wider memory fetches. Smaller filters require masking of unused elements. To increase accuracy, the accumulator stores intermediate results with greater precision than FP16, but the engine converts them back to that format for output.

CNNs often apply the same weights to multiple channels. Since the Unity DLA associates each 256KB SRAM slice with groups of four 3x3 filters, it can easily load the weights and data in parallel to the four sets of filter registers. The load instruction has a shift option that implements the commonly used stride-by-two function. To handle CNN layers of greater depth, the architecture also enables combining the outputs from the eight SRAM/filter slices. Each slice produces four results; the accelerator can write them to memory as 32 separate FP32 values or add them from each of the eight slices to produce four sums.

In image-processing CNNs, the input pixel array is seldom an integer multiple of the filter size and stride used in the convolutions. To avoid a loss of edge detail in the filter-generated feature maps, programmers must pad the input data by creating a larger array with zeros in the border rows and columns. The Unity DLA filter engines perform this padding automatically, using a mask to load zeros at any position during the convolution.

The postprocessing engine handles non-convolution-layer operations, including activation, decimation, pooling, and sorting. As Figure 2 shows, the activation functions are dedicated hardware blocks that perform the common rectified-linear-unit (ReLU), sigmoid, and hyperbolic-tangent (tanh) functions. The hardware approximates these functions to two-bit precision. The max-pooling function selects the maximum value from four, six, or eight inputs. A multiplexer enables another comparison that passes the maximum of the input or a value stored in the internal register. The sort operation handles functions such as top-N ranking of the output from the final fully connected layer.

 

Figure 2. Unity DLA postprocessing engine. Hardware accelerators handle non-convolution-layer functions, including activation, pooling, and sorting of the output from the fully connected layer.

Divide and Conquer

The parallelism of its engines allows the Unity DLA to deliver high performance and area efficiency. Each of the eight SRAM slices in the filter engine simultaneously supports four 3x3 matrix multiplies and accumulations for a total of 576 operations per cycle. The filter engine works in parallel with the input, output, and postprocessing engines.

The Unity DLA employs a data-flow architecture similar to Ceva’s NeuPro Engine but with greater parallelism (see MPR 1/29/18, “Ceva NeuPro Accelerates Neural Nets”). Both accelerators have dedicated hardware blocks for the convolution and non-convolution layers. To handle new layer functions not baked into the architecture, Ceva couples NeuPro with a vector DSP, as we expect Optimum will also do in its GP8300 processor. But whereas NeuPro only supports INT8 and INT16 data types, the Unity DLA supports the higher-precision FP16. The floating-point MACs consume more power than fixed-point MACs, but GPT uses tightly coupled SRAM slices that minimize power when moving data between engines.

The company tested its DLA using SqueezeNet, a 10-layer image-recognition CNN that UC Berkeley developed as a more efficient alternative to AlexNet. When running the model on one 1.0GHz core using all eight filter-engine slices, the Unity DLA can classify 42 images per second. It beats the Kirin 970 neural-network engine, which uses a heterogeneous architecture combining the Huawei CPU/GPU with a Cambricon accelerator to recognize 33 images per second (see MPR 1/22/18, “Neural Engines Rev Up Mobile AI”). GPT estimates a single 28nm Unity DLA core can execute more than 10 trillion operations per second (TOPS) per watt.

That power efficiency is more than 2x better than what Nvidia’s NVDLA inference engine can deliver, according to that GPU vendor’s estimate for a 256-MAC model in 16nm technology (see MPR 3/26/18, “Nvidia Shares Its Deep Learning”). The Unity DLA is also much smaller, occupying 0.6mm2 in 28nm versus 1.7mm2 for the NVDLA in the same process. Both die-area estimates exclude on-chip memory. Although the NVDLA is a fixed-point accelerator, we expect its larger die area is the result of its more complex ISA and a more extensive set of algorithm options.

A Proprietary CPU Maintains Control

GPT’s DLA connects to a host CPU and external DRAM through its AXI-bus interface; the host is necessary to control the CNN-operation sequence. The company developed a proprietary ISA for the Unity CPU it employs in the GP8300 SoC, which it also plans to license separately. It intends to offer the CPU both as a synthesizable core and in a TSMC 28nm hard macro. The design allows cache-coherent configurations of up to four cores.

The CPU implements an out-of-order execution pipeline. Figure 3 shows an earlier version, which lacks some modifications of the new one. It’s a superscalar design, issuing up to three instructions in parallel to the address, branch, floating-point, integer, and memory execution units. The core uses distributed-address, branch-jump-target, general-purpose, and floating-point registers. To boost instruction throughput and simplify the logic, it distributes register renaming to each functional unit, supporting up to 63 physical registers per unit. The branch predictor is a 2K-entry design with 2-bit saturating counters.

 

Figure 3. Unity CPU. GPT withheld details, but its published literature shows a three-issue superscalar design that can run at up to 2.0GHz in TSMC’s 28nm process.

The core integrates a direct-mapped 8KB L0 cache to store prefetched instructions. The L1 I/D caches are 32KB each and four-way set associative. The L2 cache is 2MB and eight-way associative; designers can dynamically configure it as a unified cache or divide it into a 1MB L2 and a separate 1MB tightly coupled memory. The Unity CPU implements three privilege levels, and it supports hypervisors and virtualization. It works with an MMU to run a Linux operating system.

Unifying the Platform

The goal of the Unity IP is to supply a family of processor cores that rival offerings from Arm, Ceva, and other IP vendors. GPT has disclosed few details, but its CPU appears to be a Cortex-A72-class design, sans Arm Neon–like SIMD extensions. The VLVm1 DSP fills that gap: it’s a dynamically configurable core supporting up to 64KB vectors. The Unity DLA is a high-performance power-efficient design that must compete with best-in-class licensable cores such as the Ceva NeuPro. GPT developed the initial 288-MAC DLA for TSMC’s 28nm technology, but that design scales to eight cores for a total of 2,304 half-precision floating-point MACs. By comparison, NeuPro integrates a maximum of 1,024 INT16 MACs. Whereas NeuPro is the sixth-generation Ceva design, the Unity cores are entirely unproven.

GPT designed the Unity platform as a complete heterogeneous processing architecture. It plans to support it with a GCC-based compiler, Caffe and TensorFlow tools, computer-vision APIs, and a CNN-model library. Developing a complete software stack for such a new heterogeneous architecture can take several years, but the company expects to minimize the effort by supporting an HSAIL run-time API and the HSA Finalizer, which converts HSAIL executables to the Unity 1.0 ISA (see MPR 6/24/13, “HSA Gets a Sense of HUMA”). In 2015, Optimum’s CEO became the HSA Foundation president, and to promote HSA adoption in China, the foundation last year set up a group comprising 20 Chinese universities, led by a Hua Xia researcher. We expect the former Sandbridge team has developed the Unity software tools, applying the heterogeneous-processing experience it gained from its DSPs.

The Unity IP roadmap is ambitious. The funding source for this new business is unclear, but the Chinese government has committed billions of dollars to developing a domestic AI industry (see MPR 4/30/18, “China’s AI Dream”), so it’s probably involved in Hua Xia. Companies such as Cambricon are also pursuing dual business models, launching IP cores in parallel with SoCs based on those cores (see MPR 3/5/18, “Cambricon Leads China Into AI Chips”). In that company’s case, Huawei provided a big boost by signing up as an early customer, but it uses the IP along with Arm CPU and GPU cores. GPT will have a much harder time convincing an established SoC developer to adopt its entire processor-IP platform. It can increase its chance of success by focusing first on the DLA, which addresses a wide-open field.

Price and Availability

General Processor Technologies withheld details of its pricing model. It plans to offer the Unity cores for general availability in 4Q18. For more information on the Unity platform, access www.generalprocessortech.com/solutions.

Events

Linley Fall Processor Conference 2018
Covers processors and IP cores used in embedded, communications, automotive, IoT, and server designs.
October 31 - November 1, 2018
Hyatt Regency, Santa Clara, CA
More Events »

Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products
Subscribe to our Newsletter »