Table of Contents
Fetching ...

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching

Sanghyeon Lee, Hongbeen Kim, Soojin Hwang, Guseul Heo, Minwoo Noh, Jaehyuk Huh

TL;DR

This work tackles the memory and bandwidth challenges of large-language-model inference with host-memory offloading. It introduces HybridServe, which combines activation checkpointing (activation cache) with a KV-Activation hybrid caching strategy to balance recomputation and data transfer, maximizing overlap between PCIe transfers and GPU compute. Empirical results show up to 2.19x throughput gains over FlexGen and substantial improvements in GPU utilization and host-GPU traffic, especially for larger models. The approach enables cost-effective, high-throughput LLM inference under relaxed latency constraints and provides a practical path toward scalable single-GPU or memory-constrained deployments.

Abstract

Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, extensive research has focused on expanding GPU memory by leveraging the host memory. However, LLM inference engines that utilize the host memory often face underutilization of GPU compute units, as a considerable portion of inference time is spent in loading the model onto the GPU via host-GPU interconnect. To tackle these challenges of the host memory offloading for LLM, we introduce HybridServe, an LLM inference system with activation checkpointing based on activation caching. The activation cache stores activation checkpoints generated during intermediate inference stages, allowing the fast recomputation of KV cache while model parameters are transferred to GPU from host memory. Unlike conventional methods that recompute the KV cache from scratch using token IDs, the activation cache allows bypassing projection and FFN operations. To balance between the activation recomputation and parameter loading overhead, this study proposes a KV-activation hybrid caching scheme which finds the best ratio of the key-value and activation caches to adjust the recomputation time. Our system achieves 2.19x throughput improvement over the state-of-the-art prior work for offloading both model weights and KV cache.

Efficient LLM Inference with Activation Checkpointing and Hybrid Caching

TL;DR

This work tackles the memory and bandwidth challenges of large-language-model inference with host-memory offloading. It introduces HybridServe, which combines activation checkpointing (activation cache) with a KV-Activation hybrid caching strategy to balance recomputation and data transfer, maximizing overlap between PCIe transfers and GPU compute. Empirical results show up to 2.19x throughput gains over FlexGen and substantial improvements in GPU utilization and host-GPU traffic, especially for larger models. The approach enables cost-effective, high-throughput LLM inference under relaxed latency constraints and provides a practical path toward scalable single-GPU or memory-constrained deployments.

Abstract

Recent large language models (LLMs) with enormous model sizes use many GPUs to meet memory capacity requirements incurring substantial costs for token generation. To provide cost-effective LLM inference with relaxed latency constraints, extensive research has focused on expanding GPU memory by leveraging the host memory. However, LLM inference engines that utilize the host memory often face underutilization of GPU compute units, as a considerable portion of inference time is spent in loading the model onto the GPU via host-GPU interconnect. To tackle these challenges of the host memory offloading for LLM, we introduce HybridServe, an LLM inference system with activation checkpointing based on activation caching. The activation cache stores activation checkpoints generated during intermediate inference stages, allowing the fast recomputation of KV cache while model parameters are transferred to GPU from host memory. Unlike conventional methods that recompute the KV cache from scratch using token IDs, the activation cache allows bypassing projection and FFN operations. To balance between the activation recomputation and parameter loading overhead, this study proposes a KV-activation hybrid caching scheme which finds the best ratio of the key-value and activation caches to adjust the recomputation time. Our system achieves 2.19x throughput improvement over the state-of-the-art prior work for offloading both model weights and KV cache.
Paper Structure (27 sections, 10 equations, 15 figures, 3 tables, 1 algorithm)

This paper contains 27 sections, 10 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: (a) Existing system with the KV cache, (b) System with KV-Activation hybrid cache.
  • Figure 2: KV cache management with PagedAttention pagedattention.
  • Figure 3: Performance evaluation of FlexGen flexgen with OPT-30B: (a) token generation throughput with varying input prompt lengths, and (b) memory footprint of the KV cache with 1024 input tokens.
  • Figure 4: Token generation latency normalized to the latency without recomputation, with varying recomputation ratios for OPT-30B (left) and OPT-66B (right). The red line indicates the basis for normalization
  • Figure 5: Computation difference for retrieving context of previous tokens for multi-head attention, in (a) Token recomputation and (b) Activation recomputation.
  • ...and 10 more figures