Table of Contents
Fetching ...

PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System

Hyucksung Kwon, Kyungmo Koo, Janghyeon Kim, Woongkyu Lee, Minjae Lee, Gyeonggeun Jung, Hyungdeok Lee, Yousub Jung, Jaehan Park, Yosub Song, Byeongsu Yang, Haerang Choi, Guhyun Kim, Jongsoon Won, Woojae Shin, Changhyun Kim, Gyeongcheol Shin, Yongkee Kwon, Ilkon Kim, Euicheol Lim, John Kim, Jungwook Choi

TL;DR

PIMphony tackles memory bandwidth and capacity bottlenecks that arise when executing long-context LLMs on PIM hardware. It introduces three co-designed techniques—Token-Centric PIM Partitioning, Dynamic PIM Command Scheduling, and Dynamic PIM Access—implemented via an MLIR-based compiler to maximize MAC throughput and dynamic KV-cache utilization. The approach yields up to 11.3× speedups on PIM-only systems and 8.4× on xPU+PIM configurations for models up to 72B with context lengths up to 1M tokens, while substantially improving memory efficiency. Together, these results indicate that PIM-based long-context inference becomes markedly more practical and energy-efficient for real-world workloads.

Abstract

The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to long contexts: severe channel underutilization, performance-limiting I/O bottlenecks, and massive memory waste from static KV cache management. In this work, we propose PIMphony, a PIM orchestrator that systematically resolves these issues with three co-designed techniques. First, Token-Centric PIM Partitioning (TCP) ensures high channel utilization regardless of batch size. Second, Dynamic PIM Command Scheduling (DCS) mitigates the I/O bottleneck by overlapping data movement and computation. Finally, a Dynamic PIM Access (DPA) controller enables dynamic memory management to eliminate static memory waste. Implemented via an MLIR-based compiler and evaluated on a cycle-accurate simulator, PIMphony significantly improves throughput for long-context LLM inference (up to 72B parameters and 1M context length). Our evaluations show performance boosts of up to 11.3x on PIM-only systems and 8.4x on xPU+PIM systems, enabling more efficient deployment of LLMs in real-world long-context applications.

PIMphony: Overcoming Bandwidth and Capacity Inefficiency in PIM-based Long-Context LLM Inference System

TL;DR

PIMphony tackles memory bandwidth and capacity bottlenecks that arise when executing long-context LLMs on PIM hardware. It introduces three co-designed techniques—Token-Centric PIM Partitioning, Dynamic PIM Command Scheduling, and Dynamic PIM Access—implemented via an MLIR-based compiler to maximize MAC throughput and dynamic KV-cache utilization. The approach yields up to 11.3× speedups on PIM-only systems and 8.4× on xPU+PIM configurations for models up to 72B with context lengths up to 1M tokens, while substantially improving memory efficiency. Together, these results indicate that PIM-based long-context inference becomes markedly more practical and energy-efficient for real-world workloads.

Abstract

The expansion of long-context Large Language Models (LLMs) creates significant memory system challenges. While Processing-in-Memory (PIM) is a promising accelerator, we identify that it suffers from critical inefficiencies when scaled to long contexts: severe channel underutilization, performance-limiting I/O bottlenecks, and massive memory waste from static KV cache management. In this work, we propose PIMphony, a PIM orchestrator that systematically resolves these issues with three co-designed techniques. First, Token-Centric PIM Partitioning (TCP) ensures high channel utilization regardless of batch size. Second, Dynamic PIM Command Scheduling (DCS) mitigates the I/O bottleneck by overlapping data movement and computation. Finally, a Dynamic PIM Access (DPA) controller enables dynamic memory management to eliminate static memory waste. Implemented via an MLIR-based compiler and evaluated on a cycle-accurate simulator, PIMphony significantly improves throughput for long-context LLM inference (up to 72B parameters and 1M context length). Our evaluations show performance boosts of up to 11.3x on PIM-only systems and 8.4x on xPU+PIM systems, enabling more efficient deployment of LLMs in real-world long-context applications.
Paper Structure (29 sections, 20 figures, 4 tables)

This paper contains 29 sections, 20 figures, 4 tables.

Figures (20)

  • Figure 1: Decoding Computation for Long-Context LLM ($g$: group size of GQA ainslie2023gqa)
  • Figure 2: Characteristics of long-context LLM decoding on LLM-7B (w/ GQA). (a) Compute intensity (FLOPs/Byte) decreases with context length. (b) GPU memory footprint grows with both context length and batch size; the dashed line marks the A100-80GB capacity.
  • Figure 3: Overview of PIM module/node configuration. (a): PIM module architecture. (b) and (c): PIM node configuration - heterogeneous xPU+PIM and PIM-only.
  • Figure 4: PIM utilization under (a) short(4K) and (b) long(32K) contexts using CENT cent and PIMphony on LLM-7B-32K-GQA. Batch size scales inversely with context length due to the capacity constraint.
  • Figure 5: High-level overview of PIMphony with the three main components highlighted.
  • ...and 15 more figures