Microprocessor Report (MPR) Subscribe

ARM Chooses Variable-Length Vectors

Scalable Vector Extensions Claim Performance, Compatibility for HPC

January 30, 2017

By David Kanter


Pushing further into the server market, ARM has announced the Scalable Vector Extension (SVE), a novel 128- to 2,048-bit vector extension with a vector-length-agnostic programming model. SVE offers compatibility across a range of implementations. Its ambitious goal is to create a “universal” vector extension for the ARMv8-A ecosystem that focuses on high-performance computing (HPC), is orthogonal to the existing 128-bit Neon SIMD, and can compete with Intel’s AVX-512 extensions.

SVE makes complicated tradeoffs and sacrifices to fit in ARM’s fixed-length instruction encoding while allowing variable-length vectors, predication, and vector partitioning. Variable-length vectors with length-agnostic programming date back to the supercomputers of the 1970s. They present a unique challenge for software to exploit efficiently, but they promise value for the ecosystem. ARM wants to enable licensees to differentiate by customizing their own vector implementations yet ensure cross-vendor and forward compatibility. For example, server processors may implement wide vectors whereas mobile processors take a narrower approach.

Longer vector extensions are necessary but insufficient to effectively compete for HPC systems. As Figure 1 illustrates, established vendors such as AMD, IBM, Intel, and Nvidia all rely on fixed-length vector extensions such as AltiVec, AVX, and the Cuda environment to provide massive computational throughput. Using its Neon extensions, ARM lags most of these vendors, particularly Intel, in the width of its vectors and therefore the throughput of its CPUs on vector code. SVE provides ARM licensees with a tool to fix this shortcoming. 

Figure 1. Selected vector lengths. CPU vendors employ fixed-length vectors that are between 128 and 512 bits; GPU vendors employ 1,024- to 2,048-bit vectors. SVE is designed for variable lengths from 128 to 2,048 bits.

ARM is finishing the SVE instruction-set architecture (ISA) and application binary interface (ABI) with major partners while it’s developing the software environment. It has promised C/C++ support through its proprietary compiler, along with open-source enablement for LLVM, GCC, and associated tools, such as debuggers. ARM’s lead partner for SVE is Fujitsu, which has a rich legacy of HPC experience in the Japanese market. Fujitsu has switched from SPARC to SVE. We expect other ARM server-processor vendors, such as Cavium and Qualcomm, to adopt SVE to expand their offerings into the HPC market.

ARM Reaches for HPC

ARM ambitions in the cloud-server market are well documented (see MPR 1/9/17, “Clouds Open Up for New Processors”). Until recently, however, the company has not targeted the HPC market, even though it’s roughly the same size as that of the leading cloud customers. HPC is also ripe for new processor entrants. AMD’s Opteron processors and Nvidia’s Tesla accelerators both started in HPC. Scientific-computing customers are sensitive to performance and power efficiency. More importantly, many of their applications are portable to new ISAs.

Unfortunately, ARM’s eponymous instruction set is a hindrance in HPC and other markets. The company tightly controls the ISA to prevent fragmentation and to create a fully compatible software ecosystem. Thus, its licensees cannot differentiate through the ISA. For example, Cavium’s Octeon uses a custom MIPS variant that’s tailored to networking through its bit-insertion/extraction and population-counting instructions, whereas the company’s ARM-based ThunderX lacks these features. In addition, ARM has avoided ISA extensions that would impose a cost on smartphone processors while only benefiting nascent markets. Although this approach satisfies cost-sensitive licensees such as MediaTek and Allwinner, it limits high-end architectural licensees, especially Apple, Qualcomm, Samsung and others that can add value.

When ARM announced the 64-bit extensions, an ARMv8-A core could perform up to four double-precision (DP) or eight single-precision (SP) floating-point (FP) operations per cycle using a fused-multiply-accumulate instruction on 128-bit Neon registers. At the time, Intel processors could also perform four DP or eight SP operations per cycle, using separate multiply and add instructions on 256-bit AVX operands. Since then, Intel has quadrupled the computational throughput by extending x86 with AVX2 and AVX-512, which enable a single instruction to execute 16 DP or 32 SP operations using a multiply-accumulate on 512-bit operands (see MPR 9/21/15, “Knights Landing Reshapes HPC”). This computational disparity all but precludes ARM licensees from effectively competing with Intel for HPC.

Scalable Vectors Harken Back to Cray

Unlike most other general-purpose ISAs, ARM must satisfy the varied and competing interests of its myriad partners. For example, Cavium intends its ThunderX family for servers and networking, whereas Chinese vendors such as Phytium may receive government incentives to pursue HPC. Wrangling so many different companies to agree on the details of a vector extension is a daunting task. Rather than trying to pick the right implementation (e.g., vector length), ARM opted for a vector extension that supports 128 to 2,048 bits. This scalable design enables it to focus on getting the architecture right while letting licensees choose the implementation that suits their market and performance targets.

Some general-purpose ISAs have received extensions with incrementally more-powerful vector instructions. For example, Intel has augmented x86 with MMX (64-bit MMX registers), SSE (128-bit XMM registers), SSE2, AVX (256-bit YMM registers), AVX2 (FMA), and AVX-512 (512-bit ZMM registers). (See MPR 8/29/11, “AVX2 Refreshes x86 Architecture.”)

This incremental approach, however, requires a huge number of instruction encodings, which complicates the decoders by creating many redundant versions of the same operation (e.g., integer addition for various register widths as well as for two- and three-operand format). Although x86 instructions can always tack on another extension byte, the 4-byte ARM encoding lacks enough encoding space for such an endeavor. The scalable extension elegantly solves this problem by using a single opcode for each instruction, specifying the vector length through an architectural register.

Conceptually, SVE offers a programming model that’s vector-length agnostic. Traditional vector extensions require programmers to structure their data and algorithms to use a particular SIMD configuration (e.g., four wide for single-precision data using Neon). Instead, SVE enables predication to dynamically execute as many iterations of a scalar loop as possible given the implementation’s vector length. Moreover, SVE theoretically allows code to run on any implementation while using the full vector execution resources of the underlying hardware.

Register and Predicate Lengths Vary

SVE is a predicated vector extension with destructive operations and advanced features for compiler-driven autoparallelization. It primarily targets HPC, supporting a variety of integer and FP data formats but omitting fixed-point and multimedia operations. As Figure 2 illustrates, SVE introduces new architectural state. The Z control registers (ZCRs) are accessible in execution levels 1–3 (EL1–EL3) and describe the vector length (VL) as an integer from 1 to 16; in turn, that integer defines the remaining registers. Applications (EL0) can indirectly read the ZCRs by a using special instruction. The ZCRs can virtualize the execution hardware so different privilege levels appear to have different vector lengths.

Figure 2. SVE architectural state. The extension includes 32 data registers, 17 predicate registers for control flow, and 3 system-control registers. Unlike in other vector extensions, the vector length is implementation dependent and is defined in the control registers.

The vector data registers Z0–Z31 are defined to be 128 bits x VL. For example, if VL is 4, the Z registers are 512 bits wide. The Z registers hold packed (or unpacked) data that can take 8-, 16-, 32-, or 64-bit integer format or single- or double-precision FP format. Thus, the longest vector can hold 256 elements using byte data. Z0–Z31 alias to and extend the existing V0–V31 Neon registers, which ARMv8-A requires.

SVE also contains 17 variable-length predicate registers for vectorization. These registers (P0–P15) and the first-fault register (FFR) are defined to be 16 bits x VL in the ZCRs, providing one predicate bit for each vector byte. P0–P7 are lane masks for execution that are explicitly referenced by vector instructions; they control which vector lanes are active (and allowed to write back results) or disabled (i.e., predicated off). Disabled vector lanes enable vectorized code to perform small-scale flow control. P8–P15 handle predicate arithmetic and manipulation, but they seldom control execution.

The predicate registers are coupled to the vector registers to enable mixed-data vectors. As Figure 3 illustrates, only the first predicate bit controls the associated vector lane. In the first example, P0[16] controls Z2[128:191] with 64-bit data, whereas predicate bits P0[17:23] are ignored. When using 32-bit data elements, however, P0[16] controls Z2[128:159] and predicate bit P0[20] controls Z2[160:191].

Figure 3. Predicate- and vector-register organization. Each bit in the predicate register maps to a byte in the vector register. The predicate bit that maps to the first byte in a vector lane enables or disables that lane; other bits are ignored. For example, in the case of 64-bit data elements, only the first of 8 predicate bits controls the vector lane.

The predicate registers enable scatter and gather as well as more-sophisticated loop partitioning and speculative loads. They operate with condition codes and instructions that override the default negative, zero, carry, and overflow flags (NZCV), as Table 1 shows. For example, the B.EQ instruction in A64 is interpreted differently when the condition flags are set by an SVE instruction; it tests whether any vector elements are active (i.e., not predicated off).

 

Table 1. SVE condition flow. The architecture replaces the standard A64 condition codes with new predicate conditions that are suitable for executing multiple vector lanes in parallel. (Source: ARM)

Fitting SVE in Fixed A64 Encoding

Although a four-byte instruction word is advantageous for instruction decoding, it limits the number of available opcodes. ARM was forced to make a number of clever decisions and sacrifices to deliver SVE. Because SVE is intended for HPC, it’s available only in A64 mode and cannot work with A32 or T32 (Thumb-2) instructions, reducing the instruction count. As Figure 4 illustrates, 75% of the A64 encoding space is already used, and ARM managed to fit SVE in just 6.25% of the overall space. 

Figure 4. SVE in the A64 opcode space. ARM has already allocated about 75% of the total opcode space (blue). SVE uses about 6.25%, leaving 18.75% for future use.

To maintain a fixed-length encoding, SVE sacrifices flexibility. Each register operand requires 5 bits, and a governing predicate register requires 3 bits. Choosing whether to merge or zero out inactive lanes takes another bit, and then 2 bits specify the element size (8, 16, 32, or 64 bits). This approach brings the total to 21 bits for a three-operand instruction, leaving insufficient room to specify suitable opcodes.

Thus, most SVE instructions only specify two operands, with the first source register doubling as the destination register and merging inactive lanes. These restrictions cut down the encoding from 28 bits to a manageable 15 bits. Executing a three- or four-operand function (e.g., FMA or three-operand add), however, requires pairing a new MOVPRFX instruction with the two-operand computational instruction, increasing the instruction count and code size.

Load and store instructions, which consume about half of the opcode map, include contiguous loads and stores with 8- to 64-bit granularity, as in A64. Naturally, SVE has gather and scatter instructions that operate at both 32- and 64-bit granularity to enable vectorization.

Since vector length is restricted by the hardware implementation, some common functions become quite difficult and must be redefined appropriately. For example, variable-length vector and predicate registers cannot be initialized using compile-time constants (either in memory or in the instruction stream). So ARM created the INDEX and PTRUE instructions to initialize these registers. Similarly, loop counters and spill/fill code implicitly depend on vector length. The new INCD instruction increments a scalar loop counter by the number of 64-bit elements in the vector length, and WHILELT sets a predicate based on the loop count and limit. The ADDVL instruction can increment or decrement the stack pointer by a multiple of VL, and loads and stores can access memory in units of the vector length.

Partitioning Loops for Speculation

One of SVE’s features is vector partitioning, whose goal is to vectorize loops with dynamic trip counts, such as while() loops that have a data-dependent exit condition. Conceptually, this goal is achieved by predicating a loop to enable speculative vectorization, using the predication to safely clean up any overruns.

For example, suppose a scalar loop will dynamically require 99 iterations over double-precision data elements. In an SVE implementation with VL = 4 (i.e., 512 bits wide), each vector-loop iteration will perform eight scalar-loop iterations. The first 12 vector iterations can proceed normally, but the last one must enable only the first three vector lanes and disable the last five. Prediction can easily cancel any register side effects in this scenario. (Note that this scenario is a more general case of vector-length-agnostic behavior and that the partitioning must also apply to nested loops.)

To cancel any unwanted memory accesses at the loop’s end, SVE also contains speculative vector loads using the FFR register and associated instructions, as Figure 5 illustrates. The SETFFR instruction initializes the FFR to all true, indicating no faults have occurred. A speculative gather is executed and the first two vector lanes finish, reading A[0] and A[1] into a vector register. The third lane fails when accessing A[2] and sets FFR[2:VL] to false, indicating a problem. The RDFFR instruction then reads FFR and, on the basis of the results, clears P1[0:1] so that the first two lanes are deactivated. The speculative gather is now replayed with A[2] as the first active lane, which then faults and is resolved by the OS. The SVE memory model allows unaligned accesses and is weakly ordered. Speculative accesses to device memory (I/O space) are forbidden and terminate first-fault loads.

Figure 5. Speculative vectorization with first-fault load. The first-fault-load instruction speculatively accesses memory and records faults in the FFR register. The predicate register can disable vector lanes, allowing the code to step through faults serially until all loads are successful.

SVE also enables serially executing vector lanes (which ARM calls scalar sub-loops) using the PNEXT instruction, which helps handle loops with carried dependencies. In theory, scalar sub-loops and speculative loads permits partial loop vectorization (sometimes called loop fission) using sophisticated data structures such as linked lists. But loop fission requires recompiling the code appropriately and may be impractical for more-sophisticated algorithms.

SVE Claims Performance and Compatibility

SVE is a much richer vector extension than the existing 128-bit Neon in ARMv8-A, not to mention potentially much wider. To emphasize the benefits, ARM compared the performance of 128-, 256-, and 512-bit SVE against 128-bit Neon. As the gray bars in Figure 6 show, SVE enables more vectorization than Neon, boosting performance by up to 3x for some workloads even with the same 128-bit vector size. Using longer vectors, the performance advantage is as much as 7x for a 512-bit implementation. But many workloads gain no benefit from SVE relative to Neon, demonstrating that many HPC applications still need high-performance scalar architectures to succeed. Many of the workloads that respond poorly to SVE are also unlikely to permit acceleration by GPUs.

Figure 6. Performance comparison of SVE versus Neon. SVE’s advanced features enable vectorization of more code than Neon, boosting performance for some applications even using the same 128-bit vectors (blue line). Longer vectors can provide further performance gains. (Source: ARM)

Fujitsu, a lead ARM partner, is designing an SVE-compatible chip for a successor to its K supercomputer, once the world’s fastest at 10 petaflop/s. Its goal is a massive exaflop/s (1,000 petaflop/s) supercomputer at the Riken Institute in Japan that is expected to be operational in 2022 and is likely to consume 50 megawatts. For the K supercomputer, Fujitsu used a SPARC V9 processor with 256-bit SIMD extensions, scatter/gather, and other enhancements (see MPR 9/22/14, “Sparc64 XIfx Uses Memory Cubes”). It described the processor as using a 512-bit SVE implementation with a four-operand fused-multiply-add instruction as well as hardware barriers and explicit cache controls.

A big potential benefit of SVE is that it’s theoretically “write once, run anywhere.” Traditional vector extensions require recompiling and sometimes rewriting software to use new instructions or features. In contrast, SVE’s vector-length agnosticism enables the same software to run on 128-, 512-, or even 2,048-bit implementations. In practice, whether software will run well on such different implementations is unclear, but even attempting this sort of forward compatibility is novel for a CPU.

Aiming to Redefine Vectorization and HPC

ARM’s SVE is a unique approach to vectorization that is length agnostic while providing cross-vendor and forward compatibility. The new architecture will place a tremendous burden on compilers to offer both performance and compatibility, which are typically antithetical. But ARM has included various features, such as speculative gather instructions, that could enable compilers to vectorize more applications than in prior explicit-length vector extensions. In theory, SVE allows programmers to write code for a processor with 256-bit vectors and use the same code to efficiently run on a processor with 1,024-bit vectors.

SVE helps ARM pursue servers by helping partners compete with Intel and IBM in the HPC market. Fujitsu’s post-K supercomputer will be one of the largest in the world, featuring 100,000–300,000 processors. Although that company’s computer systems are largely designed for the Japanese market, this imprimatur may set the stage for other ARM-based supercomputers in Europe, China, and around the world.

The SVE ecosystem is in its early stages. ARM’s forthcoming commercial Linux-based compiler will enable C/C++ on the architecture. Whether or how the company will support Fortran and other languages is unclear, but Fujitsu likely desires this capability, as scientific applications often employ these languages. To further its HPC presence, ARM recently acquired Allinea, which produces a variety of supercomputing-performance tools. Its 2017 roadmap for SVE includes a public architecture specification, ABI documentation, C/C++ intrinsics, and debuggers, as well as kernel, LLVM, and GCC support.

ARM has packed lots of ambition into a relatively small extension. SVE is a major departure from AltiVec, AVX, Neon, and other explicit-length SIMD-based vector extensions. Assuming software can use it efficiently, SVE could change how computer architects think about vectorization. The variable-length approach gives partners more room to differentiate while offering compatibility across implementations—a necessity for building a robust ecosystem. Most importantly, SVE creates an avenue for ARM to enter HPC, potentially almost doubling its addressable server market. This larger target increases the chance that one or more ARM licensees will carve out a sustainable niche despite Intel’s dominance.

For More Information

ARM presented SVE at Hot Chips 27 (https://goo.gl/4Ce5Fi) and has shared an overview on its web site (https://goo.gl/a7hrL4). Public release of the ISA extensions, ABI, and other information is expected by the end of 1Q17.

Events

Linley Processor Conference 2017
Covers processors and IP cores used in deep learning, embedded, communications, automotive, IoT, and server designs.
October 4 - 5, 2017
Hyatt Regency, Santa Clara, CA
More Events »

Newsletter

Linley Newsletter
Analysis of new developments in microprocessors and other semiconductor products
Subscribe to our Newsletter »