Table of Contents
Fetching ...

Jenga: Effective Memory Management for Serving LLM with Heterogeneity

Chen Zhang, Kuntai Du, Shu Liu, Woosuk Kwon, Xiangxi Mo, Yufeng Wang, Xiaoxuan Liu, Kaichao You, Zhuohan Li, Mingsheng Long, Jidong Zhai, Joseph Gonzalez, Ion Stoica

TL;DR

Jenga tackles the memory-efficiency bottleneck in batched LLM serving caused by heterogeneous embeddings and diverse token-dependency patterns. It introduces a two-level memory allocator that uses an $LCM$-based large-page layout and per-type small-page allocators, together with a prefix-subset evictor and customizable layer-specific caching policies. The approach reduces memory fragmentation and enables flexible caching strategies, achieving up to $79.6\%$ improvement in memory utilization and up to $4.92\times$ throughput improvements on diverse models and GPUs, while preserving latency. The work demonstrates practical impact by integrating with vLLM and enabling efficient serving of vision-language models, speculative decoding, and multi-model workloads without kernel changes.

Abstract

Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).

Jenga: Effective Memory Management for Serving LLM with Heterogeneity

TL;DR

Jenga tackles the memory-efficiency bottleneck in batched LLM serving caused by heterogeneous embeddings and diverse token-dependency patterns. It introduces a two-level memory allocator that uses an -based large-page layout and per-type small-page allocators, together with a prefix-subset evictor and customizable layer-specific caching policies. The approach reduces memory fragmentation and enables flexible caching strategies, achieving up to improvement in memory utilization and up to throughput improvements on diverse models and GPUs, while preserving latency. The work demonstrates practical impact by integrating with vLLM and enabling efficient serving of vision-language models, speculative decoding, and multi-model workloads without kernel changes.

Abstract

Large language models (LLMs) are widely used but expensive to run, especially as inference workloads grow. To lower costs, maximizing the request batch size by managing GPU memory efficiently is crucial. While PagedAttention has recently been proposed to improve the efficiency of memory management, we find that the growing heterogeneity in the embeddings dimensions, attention, and access patterns of modern LLM architectures introduces new challenges for memory allocation. In this paper, we present Jenga, a novel memory allocation framework for heterogeneous embeddings in LLMs. Jenga tackles two key challenges: (1) minimizing memory fragmentation when managing embeddings of different sizes, and (2) enabling flexible caching and eviction policies tailored to the specific token-dependency patterns of various layers. Jenga employs a two-level memory allocator, leveraging the least common multiple (LCM) of embedding sizes to optimize memory usage and providing APIs to express layer-specific caching logic to enhance memory reuse. We implemente Jenga on vLLM, a state-of-the-art LLM inference engine, and evaluate it with diverse LLMs, datasets, and GPU configurations. Evaluations show that Jenga improves GPU memory utilization by up to 79.6%, and increases serving throughput by up to 4.92x (1.80x on average).

Paper Structure

This paper contains 26 sections, 19 figures, 1 table.

Figures (19)

  • Figure 1: Traditional LLMs (left) v.s. Latest LLMs (right). LLMs are becoming more and more heterogeneous and produce KV caches with different sizes and dependencies, which demands a new GPU memory manager design.
  • Figure 2: Contrasting traditional LLMs (top left) and latest LLMs. LLMs are becoming more and more heterogeneous: the KV cache sizes may differ, the KV cache dependencies are different, and the LLM architecture can also diverge
  • Figure 3: Visualizing the memory waste of Llama 3.2 vision model with 2 cross-attention layers (image tokens) and 3 self-attention layers (text tokens).
  • Figure 4: Balanced and aligned cache eviction policy can improve hit rate.
  • Figure 5: Overview of Jenga: a two-level memory management system for different types of layers. Jenga is composed of the LCM allocator for first-level page allocation and the prefix subset evictor for page deallocation. Within the page, a customized allocator and evictor manage the memory for the specific layer type.
  • ...and 14 more figures