Tachyum Tries for Hyperscale Servers

October 25, 2018

Author: David Kanter

Tachyum is developing a 64-core server processor in 7nm technology for hyperscale data centers, targeting tapeout late next year. The design implements a VLIW instruction set with custom vector and matrix-multiplication instructions as well as a custom fabric.

The Prodigy core is a four-bundle eight-wide design with a short 9-stage integer pipeline and 14-stage floating-point pipeline. It packs four integer units, two vector multiply-accumulate units that are 512 bits wide, a vector permute unit, and three load/store pipelines. It can sustain 192 bytes per clock from the 16KB L1 data cache, which is backed by a 512KB L2 cache. Using 8x8 and 4x4 matrix-multiplication instructions, the core can deliver 1,024 and 512 operations per clock, respectively, for machine learning and HPC.

Although the pipeline is largely in order, the microarchitecture has some limited reordering capabilities around load misses; Tachyum claims they’ll provide some benefits of out-of-order execution. The startup claims Prodigy should be more efficient than out-of-order designs while delivering similar or better performance.

The Tachyum server processor comprises a mesh fabric that connects 64 compute tiles, including a Prodigy core and 512KB of configurable L3 cache. The L3 operates as either a private victim cache for better locality or up to a 32MB slice of distributed shared L3 cache. For external memory, the processor has eight DDR4/5 channels. The 72 multimode-serdes lanes are typically configured as 64 PCIe 5.0 lanes and two 400G Ethernet links.

Tachyum claims the power-efficient Prodigy core and a simple mesh fabric will reach 4.0GHz in 7nm at 0.825V for a total chip power of 180W, even when the vector units are active. Given that the chip has yet to tape out and has specifications similar to those of other processors, these claims may prove optimistic. 

