A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference
Michael V. DeBole, Rathinakumar Appuswamy, Neil McGlohon, Brian Taba, Steven K. Esser, Filipp Akopyan, John V. Arthur, Arnon Amir, Alexander Andreopoulos, Peter J. Carlson, Andrew S. Cassidy, Pallab Datta, Myron D. Flickner, Rajamohan Gandhasri, Guillaume J. Garreau, Megumi Ito, Jennifer L. Klamo, Jeffrey A. Kusnitz, Nathaniel J. McClatchey, Jeffrey L. McKinstry, Tapan K. Nayak, Carlos Ortega Otero, Hartmut Penner, William P. Risk, Jun Sawada, Jay Sivagnaname, Daniel F. Smith, Rafael Sousa, Ignacio Terrizzano, Takanori Ueda, Trent Gray-Donald, David Cox, Dharmendra S. Modha
TL;DR
The paper tackles the challenge of delivering scalable, low-latency, energy-efficient LLM inference in data centers by introducing a vertically integrated NorthPole-based system. It combines a pipeline-parallel mapping of LLMs to 288 on-chip-capable cards, an end-to-end cloud inference service, and a high-performance software runtime stack, achieving 115 peta-ops at 4-bit precision and 3.7 PB/s memory bandwidth within a 30 kW rack. A key advance is keeping weights and KV caches on-chip to minimize data movement, enabled by 8/4/2-bit quantization and SiLQ quantization-aware training, which allows 8B Granite-3.3-8b-instruct to achieve bf16-level accuracy. The results demonstrate practical enterprise-scale inference with low latency (2.8 ms per token) and high throughput (up to ~30k tokens/s per rack) and integrate with IBM watsonx Orchestrate for real-world AI workflows, highlighting a viable path for deploying small-to-medium LLMs in cloud or on-prem environments.
Abstract
A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model.
