Table of Contents
Fetching ...

Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

Khyati Kiyawat, Zhenxing Fan, Yasas Seneviratne, Morteza Baradaran, Akhil Shekar, Zihan Xia, Mingu Kang, Kevin Skadron

TL;DR

Sangam tackles the memory bottleneck in large language model inference by moving compute closer to memory with a chiplet-based DRAM‑PIM module connected via CXL. The architecture decouples memory density from logic, using a 7 nm logic center stripe and many 8×8 FP16 systolic arrays aligned with DDR5 banks to accelerate flat GEMMs common in decode and prefill. A hierarchical data/compute mapping (KV cache vs weights, rank/chip/bank level) and the HARMONI modeling framework enable end‑to‑end evaluation, yielding up to ~4× speedups in E2E latency, ~10× decode throughput, and orders‑of‑magnitude energy savings over H100 in many configurations and model sizes. Sangam demonstrates a scalable, cost‑effective path to higher capacity and bandwidth than GPU‑HBM setups, suitable for near‑term deployment and larger future LL model workloads.

Abstract

Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.

Sangam: Chiplet-Based DRAM-PIM Accelerator with CXL Integration for LLM Inferencing

TL;DR

Sangam tackles the memory bottleneck in large language model inference by moving compute closer to memory with a chiplet-based DRAM‑PIM module connected via CXL. The architecture decouples memory density from logic, using a 7 nm logic center stripe and many 8×8 FP16 systolic arrays aligned with DDR5 banks to accelerate flat GEMMs common in decode and prefill. A hierarchical data/compute mapping (KV cache vs weights, rank/chip/bank level) and the HARMONI modeling framework enable end‑to‑end evaluation, yielding up to ~4× speedups in E2E latency, ~10× decode throughput, and orders‑of‑magnitude energy savings over H100 in many configurations and model sizes. Sangam demonstrates a scalable, cost‑effective path to higher capacity and bandwidth than GPU‑HBM setups, suitable for near‑term deployment and larger future LL model workloads.

Abstract

Large Language Models (LLMs) are becoming increasingly data-intensive due to growing model sizes, and they are becoming memory-bound as the context length and, consequently, the key-value (KV) cache size increase. Inference, particularly the decoding phase, is dominated by memory-bound GEMV or flat GEMM operations with low operational intensity (OI), making it well-suited for processing-in-memory (PIM) approaches. However, existing in/near-memory solutions face critical limitations such as reduced memory capacity due to the high area cost of integrating processing elements (PEs) within DRAM chips, and limited PE capability due to the constraints of DRAM fabrication technology. This work presents a chiplet-based memory module that addresses these limitations by decoupling logic and memory into chiplets fabricated in heterogeneous technology nodes and connected via an interposer. The logic chiplets sustain high bandwidth access to the DRAM chiplets, which house the memory banks, and enable the integration of advanced processing components such as systolic arrays and SRAM-based buffers to accelerate memory-bound GEMM kernels, capabilities that were not feasible in prior PIM architectures. We propose Sangam, a CXL-attached PIM-chiplet based memory module that can either act as a drop-in replacement for GPUs or co-executes along side the GPUs. Sangam achieves speedup of 3.93, 4.22, 2.82x speedup in end-to-end query latency, 10.3, 9.5, 6.36x greater decoding throughput, and order of magnitude energy savings compared to an H100 GPU for varying input size, output length, and batch size on LLaMA 2-7B, Mistral-7B, and LLaMA 3-70B, respectively.

Paper Structure

This paper contains 29 sections, 17 figures, 3 tables.

Figures (17)

  • Figure 1: LLaMA 2-7B inference kernel latency breakdown on one H100 GPU. Left and right figures show breakdown under varying batch sizes and input lengths when output length is 32 and 2048, respectively. (BS = Batch Size.)
  • Figure 2: GPU utilization for GEMMs with varying M dimension.
  • Figure 3: OI characterization for kernels in different phases of E2E LLM inference on different rooflines. Here, the OI is calculated for 2048 inputs, 2048 outputs, and varying batch size (in the range of 1-64) for LLaMA 2-7B model.
  • Figure 4: Chiplet DRAM.
  • Figure 5: System Integration of the Sangam modules and organization. SA: systolic array. MUL: SIMD multiplier. MCoOI: multichip on one interposer
  • ...and 12 more figures