Table of Contents
Fetching ...

RPU -- A Reasoning Processing Unit

Matthew Adiletta, Gu-Yeon Wei, David Brooks

TL;DR

The paper tackles the memory-wall bottleneck in latency-sensitive LLM decoding by proposing the Reasoning Processing Unit (RPU), a chiplet-based system that decouples memory, compute, and network pipelines and uses Capacity-Optimized High-Bandwidth Memory (HBM-CO). It combines a modular compute fabric with a decoupled microarchitecture and an end-to-end software stack, validated through RTL/SystemC and event-driven simulation. Key findings include up to 45.3× lower latency and 18.6× higher throughput versus H100 ISO-TDP on Llama3-405B, as well as substantial energy and cost savings when memory is tuned for bandwidth-per-capacity (BW/Cap) using HBM-CO. The work demonstrates that memory customization and chiplet-based co-design can achieve sustained, bandwidth-bound inference with low latency at scale, enabling practical reasoning tasks and interactive AI systems.

Abstract

Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.

RPU -- A Reasoning Processing Unit

TL;DR

The paper tackles the memory-wall bottleneck in latency-sensitive LLM decoding by proposing the Reasoning Processing Unit (RPU), a chiplet-based system that decouples memory, compute, and network pipelines and uses Capacity-Optimized High-Bandwidth Memory (HBM-CO). It combines a modular compute fabric with a decoupled microarchitecture and an end-to-end software stack, validated through RTL/SystemC and event-driven simulation. Key findings include up to 45.3× lower latency and 18.6× higher throughput versus H100 ISO-TDP on Llama3-405B, as well as substantial energy and cost savings when memory is tuned for bandwidth-per-capacity (BW/Cap) using HBM-CO. The work demonstrates that memory customization and chiplet-based co-design can achieve sustained, bandwidth-bound inference with low latency at scale, enabling practical reasoning tasks and interactive AI systems.

Abstract

Large language model (LLM) inference performance is increasingly bottlenecked by the memory wall. While GPUs continue to scale raw compute throughput, they struggle to deliver scalable performance for memory bandwidth bound workloads. This challenge is amplified by emerging reasoning LLM applications, where long output sequences, low arithmetic intensity, and tight latency constraints demand significantly higher memory bandwidth. As a result, system utilization drops and energy per inference rises, highlighting the need for an optimized system architecture for scalable memory bandwidth. To address these challenges we present the Reasoning Processing Unit (RPU), a chiplet-based architecture designed to address the challenges of the modern memory wall. RPU introduces: (1) A Capacity-Optimized High-Bandwidth Memory (HBM-CO) that trades capacity for lower energy and cost; (2) a scalable chiplet architecture featuring a bandwidth-first power and area provisioning design; and (3) a decoupled microarchitecture that separates memory, compute, and communication pipelines to sustain high bandwidth utilization. Simulation results show that RPU performs up to 45.3x lower latency and 18.6x higher throughput over an H100 system at ISO-TDP on Llama3-405B.
Paper Structure (11 sections, 14 figures)

This paper contains 11 sections, 14 figures.

Figures (14)

  • Figure 1: RPU provides higher memory bandwidth than H100, which is required for low-latency decoding. Even up to BS=32, arithmetic intensity remains low, but requires the RPU to execute kernels which straddle the roofline.
  • Figure 2: Power and utilization characterization of H100 using NVML. Left: Power trace during distributed inference (4xH100) of Llama3-70B (Batch=32). Right: Isolated kernel profiling for memory bandwidth utilization across batch sizes and matrix dimensions (BF16).
  • Figure 3: Isolated kernel profiling for power consumption and energy efficiency across batch sizes and matrix dimensions (BF16).
  • Figure 4: Memory technology landscape comparing bandwidth per capacity versus latency per token with 100% capacity utilization for dense LLMs. A technology gap exists in the Goldilocks range for low-latency inference.
  • Figure 5: Tradeoffs in HBM-CO memories, illustrating that high-BW/Cap memories are up to $\sim$2.5x more energy efficient than an HBM3e device, but $\sim$1.8x the higher cost per GB.
  • ...and 9 more figures