Table of Contents
Fetching ...

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

Qingyuan Liu, Liyan Chen, Yanning Yang, Haocheng Wang, Dong Du, Zhigang Mao, Naifeng Jing, Yubin Xia, Haibo Chen

TL;DR

L3 tackles the memory bottlenecks of long-context LLM inference by offloading decoding MHA and KV caches to DIMM-PIM, creating scalable capacity and bandwidth far beyond GPU memory limits. It couples zero-latency in-flight re-layout, KV cache mapping strategies, and bubble-free kernel fusion with a dependency-aware cross-device scheduler to hide communication and balance work between GPUs and DIMM-PIM. The system demonstrates up to 6.1× throughput gains and up to 14.3× larger batch sizes across multiple models and traces, while maintaining low time-between-tokens and manageable hardware overhead. This work establishes a practical path toward high-throughput, long-context LLM inference in data-center environments by leveraging host memory DIMM-PIM as a scalable co-processor for decoding MHA.

Abstract

Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1$\times$ speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes.

L3: DIMM-PIM Integrated Architecture and Coordination for Scalable Long-Context LLM Inference

TL;DR

L3 tackles the memory bottlenecks of long-context LLM inference by offloading decoding MHA and KV caches to DIMM-PIM, creating scalable capacity and bandwidth far beyond GPU memory limits. It couples zero-latency in-flight re-layout, KV cache mapping strategies, and bubble-free kernel fusion with a dependency-aware cross-device scheduler to hide communication and balance work between GPUs and DIMM-PIM. The system demonstrates up to 6.1× throughput gains and up to 14.3× larger batch sizes across multiple models and traces, while maintaining low time-between-tokens and manageable hardware overhead. This work establishes a practical path toward high-throughput, long-context LLM inference in data-center environments by leveraging host memory DIMM-PIM as a scalable co-processor for decoding MHA.

Abstract

Large Language Models (LLMs) increasingly require processing long text sequences, but GPU memory limitations force difficult trade-offs between memory capacity and bandwidth. While HBM-based acceleration offers high bandwidth, its capacity remains constrained. Offloading data to host-side DIMMs improves capacity but introduces costly data swapping overhead. We identify that the critical memory bottleneck lies in the decoding phase of multi-head attention (MHA) exclusively, which demands substantial capacity for storing KV caches and high bandwidth for attention computation. Our key insight reveals this operation uniquely aligns with modern DIMM-based processing-in-memory (PIM) architectures, which offers scalability of both capacity and bandwidth. Based on this observation and insight, we propose L3, a hardware-software co-designed system integrating DIMM-PIM and GPU devices. L3 introduces three innovations: First, hardware redesigns resolve data layout mismatches and computational element mismatches in DIMM-PIM, enhancing LLM inference utilization. Second, communication optimization enables hiding the data transfer overhead with the computation. Third, an adaptive scheduler coordinates GPU-DIMM-PIM operations to maximize parallelism between devices. Evaluations using real-world traces show L3 achieves up to 6.1 speedup over state-of-the-art HBM-PIM solutions while significantly improving batch sizes.

Paper Structure

This paper contains 26 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: LLM inference process.The FC operations for each request can be batched and are compute-intensive, whereas MHA cannot be batched and is memory bandwidth-intensive.
  • Figure 2: GPU memory bottlenecks with Llama-7B on A100.Capacity constraints the batch size while bandwidth constraints the time-between-tokens. (a) profiles the GPU utilizations during the "Feed-forward" operation. (b) normalizes the decoding latency of 1K token length to 1. The inference system is S-LoRA sheng2023slora.
  • Figure 3: Challenges of DIMM-PIM integration.
  • Figure 4: L3 overview.
  • Figure 5: DIMM-PIM architecure for LLM attention.
  • ...and 6 more figures