Linley Newsletter

Habana Offers Gaudi for AI Training

June 18, 2019

Author: Linley Gwennap

Habana Labs has achieved first silicon of its initial accelerator for neural-network training, outperforming Nvidia’s fastest chip on at least one benchmark. The startup claims the new Gaudi chip will exceed 1,650 images per second (IPS) when training the popular ResNet-50 model. This performance is slightly better than what Nvidia reports for its flagship V100. Habana says Gaudi will use only 140W when running this benchmark, half the V100’s power. These results would make Gaudi twice as power efficient as the V100 and even its little brother, the Tesla T4.

The 16nm Gaudi builds on the same basic architecture as Habana’s earlier Goya inference accelerator, which is already in production. Whereas Goya focuses on integer computation, Gaudi fully supports the floating-point formats that most training uses. Gaudi integrates High Bandwidth Memory (HBM2), and to enable large chip clusters, it features 100G Ethernet with remote-DMA (RDMA) capability. For ResNet-50, Habana expects clusters of up to 640 Gaudi chips to deliver near-linear performance scaling. Nvidia, by contrast, sees a severe efficiency drop beyond 16 GPUs.

Whereas Habana sells a single Goya-based product—a PCIe accelerator card—it plans to offer three Gaudi form factors. In addition to a 200W PCIe card, Gaudi will come in an OCP-compliant accelerator module that dissipates up to 300W. Facebook originated this Open Compute Project module design, and several chip providers (but not Nvidia) plan to support it. Habana is also developing a rack-mountable system, the HLS-1, that contains eight Gaudi chips and can serve as an element of a large cluster. The company is testing first silicon and expects all three Gaudi products to sample by the end of this year, which should lead to volume production by mid-2020.

