Table of Contents
Fetching ...

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System

Lian Liu, Shixin Zhao, Yutian Zhou, Yintao He, Mengdi Wang, Yinhe Han, Ying Wang

TL;DR

PAM tackles the dual memory bottlenecks in KV-centric LLM serving by coordinating heterogeneous PIM devices across a tiered memory hierarchy. It introduces PAM attention, a token-wise parallel algorithm that enables in situ softmax and cross-tier reductions, and a KV-centric management stack with intra-device mapping and inter-device scheduling to balance workload dynamically. The system unifies HBM-PIM, DDR-PIM, and SSD-PIM, supported by a compiler, memory management, and a processing scheduler, and demonstrates up to substantial throughput and energy-efficiency gains over state-of-the-art baselines across online and offline tasks with long contexts. These results indicate PAM’s potential to enable scalable, cost-effective serving of large-scale LLMs in real-world deployments, particularly for long-context and high-throughput scenarios.

Abstract

The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and KV cache storage, have emerged as critical bottlenecks. They require massive memory bandwidth and capacity. Unfortunately, existing LLM serving systems, optimized for compute-bound workloads, fail to handle these memory-intensive operations effectively. Even with Processing-In-Memory (PIM) technology, current single-level memory designs cannot simultaneously satisfy the bandwidth and capacity requirements. To address these challenges, we propose Processing Across Memory (PAM), a KV-centric LLM serving system that coordinates heterogeneous PIM-enabled memory devices within a hierarchical architecture. PAM introduces a novel computing paradigm to balance high memory bandwidth with scalable capacity. First, PAM exploits the inherent context locality in KV access patterns to intelligently distribute KV tokens across the memory hierarchy. Second, to further exploit context locality, it introduces the PAMattention algorithm, enabling fine-grained parallel attention computation across heterogeneous PIM devices. Finally, PAM incorporates an intra-device KV mapping, inter-device KV migration interface, and an inter-device online KV scheduling algorithm to dynamically balance computational workloads. By addressing both bandwidth and capacity demands simultaneously, PAM significantly enhances the efficiency and scalability of LLM serving systems, paving the way for cost-effective, high-performance solutions in the era of large-scale AI.

PAM: Processing Across Memory Hierarchy for Efficient KV-centric LLM Serving System

TL;DR

PAM tackles the dual memory bottlenecks in KV-centric LLM serving by coordinating heterogeneous PIM devices across a tiered memory hierarchy. It introduces PAM attention, a token-wise parallel algorithm that enables in situ softmax and cross-tier reductions, and a KV-centric management stack with intra-device mapping and inter-device scheduling to balance workload dynamically. The system unifies HBM-PIM, DDR-PIM, and SSD-PIM, supported by a compiler, memory management, and a processing scheduler, and demonstrates up to substantial throughput and energy-efficiency gains over state-of-the-art baselines across online and offline tasks with long contexts. These results indicate PAM’s potential to enable scalable, cost-effective serving of large-scale LLMs in real-world deployments, particularly for long-context and high-throughput scenarios.

Abstract

The widespread adoption of Large Language Models (LLMs) has exponentially increased the demand for efficient serving systems. With growing requests and context lengths, key-value (KV)-related operations, including attention computation and KV cache storage, have emerged as critical bottlenecks. They require massive memory bandwidth and capacity. Unfortunately, existing LLM serving systems, optimized for compute-bound workloads, fail to handle these memory-intensive operations effectively. Even with Processing-In-Memory (PIM) technology, current single-level memory designs cannot simultaneously satisfy the bandwidth and capacity requirements. To address these challenges, we propose Processing Across Memory (PAM), a KV-centric LLM serving system that coordinates heterogeneous PIM-enabled memory devices within a hierarchical architecture. PAM introduces a novel computing paradigm to balance high memory bandwidth with scalable capacity. First, PAM exploits the inherent context locality in KV access patterns to intelligently distribute KV tokens across the memory hierarchy. Second, to further exploit context locality, it introduces the PAMattention algorithm, enabling fine-grained parallel attention computation across heterogeneous PIM devices. Finally, PAM incorporates an intra-device KV mapping, inter-device KV migration interface, and an inter-device online KV scheduling algorithm to dynamically balance computational workloads. By addressing both bandwidth and capacity demands simultaneously, PAM significantly enhances the efficiency and scalability of LLM serving systems, paving the way for cost-effective, high-performance solutions in the era of large-scale AI.
Paper Structure (47 sections, 5 equations, 13 figures, 2 tables, 2 algorithms)

This paper contains 47 sections, 5 equations, 13 figures, 2 tables, 2 algorithms.

Figures (13)

  • Figure 1: Illustration of (a) vLLM with offloading; (b) Layered PIM; and (c) PAM. System illustration is on the left side of each subfigure and the performance breakdown is on the right side of each subfigure. The x-axis of the histogram is batch size. The red solid line indicates the maximum capacity of HBM and the red dotted line indicates the maximum DDR capacity.
  • Figure 2: (a) LLM architecture. (b) The requirements of LLM serving. (c) Roofline model of attention computation.
  • Figure 3: Exploiting context locality to achieve hierarchy memory processing.
  • Figure 4: Overview of the proposed KV-centric LLM serving system, PAM.
  • Figure 5: Illustration of PAMattention processing.
  • ...and 8 more figures