Table of Contents
Fetching ...

Fast State Restoration in LLM Serving with HCache

Shiwei Gao, Youmin Chen, Jiwu Shu

TL;DR

HCache is proposed, a novel LLM state restoration method that is designed to restore LLM states from intermediate activations and thus utilize computational and I/O resources with low overhead.

Abstract

The growing complexity of LLM usage today, e.g., multi-round conversation and retrieval-augmented generation (RAG), makes contextual states (i.e., KV cache) reusable across user requests. Given the capacity constraints of GPU memory, only a limited number of contexts can be cached on GPU for reusing. Existing inference systems typically evict part of the KV cache and restore it by recomputing it from the original tokens or offloading it to host storage for later retrieval, both of which introduce substantial computational or I/O overheads. We propose HCache, a novel LLM state restoration method. Its key idea is to restore LLM states from intermediate activations and thus utilize computational and I/O resources with low overhead. We enhance HCache with two techniques, including i) a bubble-free restoration scheduler that integrates resource-complementary methods to optimize the balance between computation and IO tasks; and ii) a chunk-based storage manager to address the layout mismatch issue (i.e., layer-before-token saving versus token-before-layer restoration). Our evaluations, conducted using real-world tasks, show that HCache reduces the TTFT by up to 1.93X compared to KV offload while consuming 1.92-2.40X less storage space; compared to token recomputation, HCache achieves up to 5.73X reduction in TTFT.

Fast State Restoration in LLM Serving with HCache

TL;DR

HCache is proposed, a novel LLM state restoration method that is designed to restore LLM states from intermediate activations and thus utilize computational and I/O resources with low overhead.

Abstract

The growing complexity of LLM usage today, e.g., multi-round conversation and retrieval-augmented generation (RAG), makes contextual states (i.e., KV cache) reusable across user requests. Given the capacity constraints of GPU memory, only a limited number of contexts can be cached on GPU for reusing. Existing inference systems typically evict part of the KV cache and restore it by recomputing it from the original tokens or offloading it to host storage for later retrieval, both of which introduce substantial computational or I/O overheads. We propose HCache, a novel LLM state restoration method. Its key idea is to restore LLM states from intermediate activations and thus utilize computational and I/O resources with low overhead. We enhance HCache with two techniques, including i) a bubble-free restoration scheduler that integrates resource-complementary methods to optimize the balance between computation and IO tasks; and ii) a chunk-based storage manager to address the layout mismatch issue (i.e., layer-before-token saving versus token-before-layer restoration). Our evaluations, conducted using real-world tasks, show that HCache reduces the TTFT by up to 1.93X compared to KV offload while consuming 1.92-2.40X less storage space; compared to token recomputation, HCache achieves up to 5.73X reduction in TTFT.
Paper Structure (33 sections, 11 equations, 15 figures, 3 tables)

This paper contains 33 sections, 11 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: State Restoration Method Comparison. Recomputation: compute KV cache from history tokens when reused; KV cache offload: save KV cache at host storage and fetch them to GPU memory when reused; HCache saves 6$\times$ computational and 2$\times$ IO resources.
  • Figure 2: Transformer Architecture.
  • Figure 3: Charactreistic of Multi-Round Conversation.
  • Figure 4: Comparison of State Restoration Overhead. We use the L-Eval trace; Opt-30B runs on 4$\times$ A100-40G GPUs with tensor parallelism, while the rest two run on one A100; 4$\times$ PM9A3 SSDs are used as the storage backend to save offloaded KV caches.
  • Figure 5: Example of Pipelined Restoration with HCache.
  • ...and 10 more figures