Table of Contents
Fetching ...

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

Yuyang Tian, Desen Sun, Yi Ding, Sihang Liu

TL;DR

This paper tackles the overlooked source of carbon emissions in LLM serving: storage embodied carbon from caching KV contexts. It introduces GreenCache, a carbon-aware caching framework that profiles performance/power, predicts load and carbon intensity, and uses an ILP-based optimizer to resize cache while enforcing SLO attainment. A novel Least Carbon Savings replacement policy guides eviction decisions to balance operational savings against embodied costs. Across realistic traces and two LLMs, GreenCache achieves substantial carbon reductions (average ~15%, up to ~25%), while maintaining latency targets, demonstrating the practical viability of dynamic, carbon-aware storage management in LLM deployment.

Abstract

As large language models (LLMs) become widely used, their environmental impact, especially carbon emission, has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 15.1 % when serving Llama-3 70B in the FR grid, with reductions reaching up to 25.3 %, while staying within latency constraints for > 90 % of requests.

Cache Your Prompt When It's Green: Carbon-Aware Caching for Large Language Model Serving

TL;DR

This paper tackles the overlooked source of carbon emissions in LLM serving: storage embodied carbon from caching KV contexts. It introduces GreenCache, a carbon-aware caching framework that profiles performance/power, predicts load and carbon intensity, and uses an ILP-based optimizer to resize cache while enforcing SLO attainment. A novel Least Carbon Savings replacement policy guides eviction decisions to balance operational savings against embodied costs. Across realistic traces and two LLMs, GreenCache achieves substantial carbon reductions (average ~15%, up to ~25%), while maintaining latency targets, demonstrating the practical viability of dynamic, carbon-aware storage management in LLM deployment.

Abstract

As large language models (LLMs) become widely used, their environmental impact, especially carbon emission, has attracted more attention. Prior studies focus on compute-related carbon emissions. In this paper, we find that storage is another key contributor. LLM caching, which saves and reuses KV caches for repeated context, reduces operational carbon by avoiding redundant computation. However, this benefit comes at the cost of embodied carbon from high-capacity, high-speed SSDs. As LLMs scale, the embodied carbon of storage grows significantly. To address this tradeoff, we present GreenCache, a carbon-aware cache management framework that dynamically derives resource allocation plans for LLM serving. GreenCache analyzes the correlation between carbon emission and SLO satisfaction, reconfiguring the resource over time to keep the balance between SLO and carbon emission under dynamic workloads. Evaluations from real traces demonstrate that GreenCache achieves an average carbon reduction of 15.1 % when serving Llama-3 70B in the FR grid, with reductions reaching up to 25.3 %, while staying within latency constraints for > 90 % of requests.

Paper Structure

This paper contains 41 sections, 1 theorem, 9 equations, 21 figures, 3 tables.

Key Result

Theorem 1

The GreenCache optimization problem in Eq. (eq:ilp) is NP-hard, even in a restricted setting where each time step only allows a binary cache decision (off/on), and the SLO requirement is a global ratio constraint: at least a configurable fraction $\rho$ of all requests over the horizon satisfy the T

Figures (21)

  • Figure 1: Illustration of caching for LLM serving.
  • Figure 2: (a) Average carbon intensity (CI) and energy sources of four grids in 2024 ele_maps. (b) CI variation due to energy sources of the CISO grid on July 6, 2022 maji2022carboncast.
  • Figure 3: (a) Latency and speedup from caching under different context lengths. (b) Latency breakdown.
  • Figure 3: Hit rate (Llama-3 70B).
  • Figure 4: Distribution of context length in (a) ShareGPT sharegpt_dataset and (b) TriviaQA joshi2017triviaqa.
  • ...and 16 more figures

Theorems & Definitions (1)

  • Theorem : NP-hardness