Table of Contents
Fetching ...

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

Rui Xie, Asad Ul Haq, Linsen Ma, Yunhua Fang, Zirak Burzin Engineer, Liu Liu, Tong Zhang

TL;DR

This work tackles memory bandwidth and capacity bottlenecks in LLM inference by embedding lossless compression and context-aware dynamic quantization inside the on-chip memory controller. It introduces bit-plane disaggregation and cross-token KV cache clustering with exponent delta transformations to dramatically improve lossless compressibility of weights and KV data, enabling standard compressors like LZ4 and ZSTD to achieve substantial footprint reductions without accuracy loss. The approach is validated across public LLMs, showing up to 25.2% weight and 46.9% KV-cache footprint reductions, plus DRAM access energy reductions up to 29.9% and model-load latency improvements up to 30%, with a hardware prototype delivering up to 2 TB/s throughput on a 7 nm process. The results demonstrate the viability of LLM-aware memory control as a practical path to scalable, energy-efficient large-scale inference with modest hardware overhead. This work offers a concrete architectural mechanism to align memory bandwidth, capacity, and energy with the dynamic precision needs of modern LLMs.

Abstract

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity and/or bandwidth consumption at the cost of slight degradation in inference quality. This paper introduces a design solution that further alleviates memory bottlenecks by enhancing the on-chip memory controller in AI accelerators to achieve two main objectives: (1) significantly reducing memory capacity and bandwidth usage through lossless block compression~(e.g., LZ4 and ZSTD) of model weights and key-value (KV) cache without compromising inference quality, and (2) enabling memory bandwidth and energy consumption to scale proportionally with context-dependent dynamic quantization. These goals are accomplished by equipping the on-chip memory controller with mechanisms to improve fine-grained bit-level accessibility and compressibility of weights and KV cache through LLM-aware configuration of in-memory placement and representation. Experimental results on publicly available LLMs demonstrate the effectiveness of this approach, showing memory footprint reductions of 25.2\% for model weights and 46.9\% for KV cache. In addition, our hardware prototype at 4\,GHz and 32 lanes (7\,nm) achieves 8\,TB/s throughput with a modest area overhead (under 3.8\,mm\(^2\)), which underscores the viability of LLM-aware memory control as a key to efficient large-scale inference.

Reimagining Memory Access for LLM Inference: Compression-Aware Memory Controller Design

TL;DR

This work tackles memory bandwidth and capacity bottlenecks in LLM inference by embedding lossless compression and context-aware dynamic quantization inside the on-chip memory controller. It introduces bit-plane disaggregation and cross-token KV cache clustering with exponent delta transformations to dramatically improve lossless compressibility of weights and KV data, enabling standard compressors like LZ4 and ZSTD to achieve substantial footprint reductions without accuracy loss. The approach is validated across public LLMs, showing up to 25.2% weight and 46.9% KV-cache footprint reductions, plus DRAM access energy reductions up to 29.9% and model-load latency improvements up to 30%, with a hardware prototype delivering up to 2 TB/s throughput on a 7 nm process. The results demonstrate the viability of LLM-aware memory control as a practical path to scalable, energy-efficient large-scale inference with modest hardware overhead. This work offers a concrete architectural mechanism to align memory bandwidth, capacity, and energy with the dynamic precision needs of modern LLMs.

Abstract

The efficiency of Large Language Model~(LLM) inference is often constrained by substantial memory bandwidth and capacity demands. Existing techniques, such as pruning, quantization, and mixture of experts/depth, reduce memory capacity and/or bandwidth consumption at the cost of slight degradation in inference quality. This paper introduces a design solution that further alleviates memory bottlenecks by enhancing the on-chip memory controller in AI accelerators to achieve two main objectives: (1) significantly reducing memory capacity and bandwidth usage through lossless block compression~(e.g., LZ4 and ZSTD) of model weights and key-value (KV) cache without compromising inference quality, and (2) enabling memory bandwidth and energy consumption to scale proportionally with context-dependent dynamic quantization. These goals are accomplished by equipping the on-chip memory controller with mechanisms to improve fine-grained bit-level accessibility and compressibility of weights and KV cache through LLM-aware configuration of in-memory placement and representation. Experimental results on publicly available LLMs demonstrate the effectiveness of this approach, showing memory footprint reductions of 25.2\% for model weights and 46.9\% for KV cache. In addition, our hardware prototype at 4\,GHz and 32 lanes (7\,nm) achieves 8\,TB/s throughput with a modest area overhead (under 3.8\,mm), which underscores the viability of LLM-aware memory control as a key to efficient large-scale inference.

Paper Structure

This paper contains 19 sections, 7 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Percentage contribution of KV cache and model weights to total memory footprint with increasing sequence length in LLaMA 3.1 8B model.
  • Figure 2: Illustration of dynamic weight quantization in a transformer block based on Mixture-of-Depth-Expert (MoDE) raposo2024mixture.
  • Figure 3: Accuracy comparison among quantization configurations on prune-only (a) and dynamic quantization (b), (c) based on LLaMA-MoE-3.5B zhu2024llama on PIQA bisk2020piqa, WinoGrande sakaguchi2021winogrande, LAMBADA paperno2016lambada and MMLU hendrycks2020measuring datasets.
  • Figure 4: Mitigating memory bottlenecks by enhancing on-chip memory controller within AI accelerators.
  • Figure 5: Illustration of bit-plane disaggregation in-memory placement.
  • ...and 6 more figures