Table of Contents
Fetching ...

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

Hanchen Li, Yuhan Liu, Yihua Cheng, Kuntai Du, Junchen Jiang

TL;DR

This work investigates storing and reusing KV caches for context-augmented LLMs in public clouds to reduce latency and cost. It presents an analytical model contrasting text recomputation and KV-cache reuse, with explicit formulations for $C_{text}$ and $C_{KV}$ that account for GPU compute, storage, and transmission costs, and validates the findings via simulation and AWS pricing scenarios. The results indicate that reusing KV caches can achieve substantial delay and cost savings, especially for long-context tasks, while storage costs remain a small fraction of total expense, suggesting a practical path toward economical context augmentation. Overall, the study demonstrates a viable strategy for deploying more economical LLM services by leveraging cloud-stored KV caches.

Abstract

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including colocating stored KV caches with (or close to) GPUs to various KV cache compression. However, a key question remains unanswered: can these delay reductions also be economically favorable? Specifically, we ask whether a developer can use public cloud services to store precomputed KV caches and reuse them to save delay without incurring more costs in terms of compute, storage, and network. To answer this question, we propose an validated analytical model for the cloud cost (in compute, storage, and network) of storing and reusing KV caches based on various workload parameters, such as reuse frequency, generated text lengths, model sizes, etc. Preliminary results show that KV cache reusing is able to save both delay and cloud cost across a range of workloads with long context. And we call more efforts on building more economical context augmented LLM by KV cache reusing.

Towards More Economical Context-Augmented LLM Generation by Reusing Stored KV Cache

TL;DR

This work investigates storing and reusing KV caches for context-augmented LLMs in public clouds to reduce latency and cost. It presents an analytical model contrasting text recomputation and KV-cache reuse, with explicit formulations for and that account for GPU compute, storage, and transmission costs, and validates the findings via simulation and AWS pricing scenarios. The results indicate that reusing KV caches can achieve substantial delay and cost savings, especially for long-context tasks, while storage costs remain a small fraction of total expense, suggesting a practical path toward economical context augmentation. Overall, the study demonstrates a viable strategy for deploying more economical LLM services by leveraging cloud-stored KV caches.

Abstract

Across large language model (LLM) applications, we observe an emerging trend for reusing KV caches to save the prefill delays of processing repeated input texts in different LLM inputs. This has led to a broad design space, including colocating stored KV caches with (or close to) GPUs to various KV cache compression. However, a key question remains unanswered: can these delay reductions also be economically favorable? Specifically, we ask whether a developer can use public cloud services to store precomputed KV caches and reuse them to save delay without incurring more costs in terms of compute, storage, and network. To answer this question, we propose an validated analytical model for the cloud cost (in compute, storage, and network) of storing and reusing KV caches based on various workload parameters, such as reuse frequency, generated text lengths, model sizes, etc. Preliminary results show that KV cache reusing is able to save both delay and cloud cost across a range of workloads with long context. And we call more efforts on building more economical context augmented LLM by KV cache reusing.

Paper Structure

This paper contains 3 sections, 2 equations, 2 figures.

Figures (2)

  • Figure 1: Illustration of text recomputation and KV cache reusing.
  • Figure 2: The cost and end-to-end delay for Llama-7B by varying input lengths.