Table of Contents
Fetching ...

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

Michael V. DeBole, Rathinakumar Appuswamy, Neil McGlohon, Brian Taba, Steven K. Esser, Filipp Akopyan, John V. Arthur, Arnon Amir, Alexander Andreopoulos, Peter J. Carlson, Andrew S. Cassidy, Pallab Datta, Myron D. Flickner, Rajamohan Gandhasri, Guillaume J. Garreau, Megumi Ito, Jennifer L. Klamo, Jeffrey A. Kusnitz, Nathaniel J. McClatchey, Jeffrey L. McKinstry, Tapan K. Nayak, Carlos Ortega Otero, Hartmut Penner, William P. Risk, Jun Sawada, Jay Sivagnaname, Daniel F. Smith, Rafael Sousa, Ignacio Terrizzano, Takanori Ueda, Trent Gray-Donald, David Cox, Dharmendra S. Modha

TL;DR

The paper tackles the challenge of delivering scalable, low-latency, energy-efficient LLM inference in data centers by introducing a vertically integrated NorthPole-based system. It combines a pipeline-parallel mapping of LLMs to 288 on-chip-capable cards, an end-to-end cloud inference service, and a high-performance software runtime stack, achieving 115 peta-ops at 4-bit precision and 3.7 PB/s memory bandwidth within a 30 kW rack. A key advance is keeping weights and KV caches on-chip to minimize data movement, enabled by 8/4/2-bit quantization and SiLQ quantization-aware training, which allows 8B Granite-3.3-8b-instruct to achieve bf16-level accuracy. The results demonstrate practical enterprise-scale inference with low latency (2.8 ms per token) and high throughput (up to ~30k tokens/s per rack) and integrate with IBM watsonx Orchestrate for real-world AI workflows, highlighting a viable path for deploying small-to-medium LLMs in cloud or on-prem environments.

Abstract

A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model.

A Scalable NorthPole System with End-to-End Vertical Integration for Low-Latency and Energy-Efficient LLM Inference

TL;DR

The paper tackles the challenge of delivering scalable, low-latency, energy-efficient LLM inference in data centers by introducing a vertically integrated NorthPole-based system. It combines a pipeline-parallel mapping of LLMs to 288 on-chip-capable cards, an end-to-end cloud inference service, and a high-performance software runtime stack, achieving 115 peta-ops at 4-bit precision and 3.7 PB/s memory bandwidth within a 30 kW rack. A key advance is keeping weights and KV caches on-chip to minimize data movement, enabled by 8/4/2-bit quantization and SiLQ quantization-aware training, which allows 8B Granite-3.3-8b-instruct to achieve bf16-level accuracy. The results demonstrate practical enterprise-scale inference with low latency (2.8 ms per token) and high throughput (up to ~30k tokens/s per rack) and integrate with IBM watsonx Orchestrate for real-world AI workflows, highlighting a viable path for deploying small-to-medium LLMs in cloud or on-prem environments.

Abstract

A vertically integrated, end-to-end, research prototype system combines 288 NorthPole neural inference accelerator cards, offline training algorithms, a high-performance runtime stack, and a containerized inference pipeline to deliver a scalable and efficient cloud inference service. The system delivers 115 peta-ops at 4-bit integer precision and 3.7 PB/s of memory bandwidth across 18 2U servers, while consuming only 30 kW of power and weighing 730 kg in a 0.67 m^2 42U rack footprint. The system can run 3 simultaneous instances of the 8-billion-parameter open-source IBM Granite-3.3-8b-instruct model at 2,048 context length with 28 simultaneous users and a per-user inter-token latency of 2.8 ms. The system is scalable, modular, and reconfigurable, supporting various model sizes and context lengths, and is ideal for deploying agentic workflows for enterprise AI applications in existing data center (cloud, on-prem) environments. For example, the system can support 18 instances of a 3-billion-parameter model or a single instance of a 70-billion-parameter model.

Paper Structure

This paper contains 23 sections, 6 figures, 2 tables.

Figures (6)

  • Figure 1: End-to-end view of the NorthPole LLM inference system. The NorthPole chip (upper left) is a highly parallel neural inference accelerator deployed in a PCIe form factor by a NorthPole card (middle left), 16 of which are hosted by a 2U server (lower left) to form a NorthPole LLM server node (lower right), up to 18 of which make up a NorthPole LLM inference rack. Each rack can run multiple LLM instances simultaneously at high throughput and low latency. Larger models can be deployed by connecting multiple racks. Access to the NorthPole-accelerated models are provided via the IBM watsonx web interface and API.
  • Figure 2: Mapping an 8-billion-parameter LLM to NorthPole. The attention and multi-layer perceptron (MLP) blocks of each of the 40 transformer layers of the Granite-3.3-8b-instruct model (left) are mapped to separate NorthPole cards (lower right) using pipeline parallelism. The output layer is split across 4 NorthPole cards using tensor parallelism. The full model uses 84 cards in 6 NorthPole LLM server nodes that are interconnected via 200 GbE and occupy 12U of a NorthPole LLM inference rack (upper right).
  • Figure 3: Mapping a 20-billion-parameter LLM to NorthPole. The attention and expert blocks of each of the 24 transformer/MoE layers of the gpt-oss-20b model are mapped to separate NorthPole cards using tensor and pipeline parallelism. The full model uses 104 cards in 7 NorthPole LLM server nodes. The 120-billion-parameter gpt-oss-120b model (not shown) can likewise be mapped using 11 cards for experts in each of the 36 layers, for a total of 440 cards in 28 server nodes across 2 inference racks.
  • Figure 4: NorthPole LLM inference service. Each LLM instance runs on its own pipeline-parallel chain of one or more NorthPole LLM server nodes. Each server node hosts a NorthPole application container that controls, configures, and communicates with its NorthPole cards. The first server node in the chain hosts two additional containers: a pipeline management container to handle input and output for the server node pipeline, and a sequence head container for pre- and postprocessing tasks like tokenization and interacting with the cloud services (left) that connect the user with the NorthPole LLM inference system (right).
  • Figure 5: Accuracy of quantized (A8-C8-W4) Granite-3.3-8b-instruct model when run on NorthPole, compared to original bfloat16 (bf16) model, on 19 benchmarks, including common sense reasoning tasks, and tasks from versions 1 and 2 of the Open LLM Leaderboard.
  • ...and 1 more figures