Table of Contents
Fetching ...

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

Harsh Vardhan Bansal

TL;DR

LLMCache addresses transformer inference latency by introducing a model-agnostic, layer-wise activation caching framework that reuses intermediate representations across semantically similar inputs via per-layer caches and semantic fingerprints. It extends beyond token-level KV caching to support both encoder and decoder architectures without architectural changes, using lightweight fingerprints and adaptive eviction to maintain freshness. Empirical results across BERT and GPT-2 on WikiText-103, SQuAD, and OpenBookQA show up to 3.1x speedups with negligible accuracy loss, with high hit rates in lower/mid layers. The approach demonstrates practical system-level gains for real-time and large-scale transformer deployment and outlines design considerations and avenues for future improvements like dynamic thresholds and distributed caches.

Abstract

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

LLMCache: Layer-Wise Caching Strategies for Accelerated Reuse in Transformer Inference

TL;DR

LLMCache addresses transformer inference latency by introducing a model-agnostic, layer-wise activation caching framework that reuses intermediate representations across semantically similar inputs via per-layer caches and semantic fingerprints. It extends beyond token-level KV caching to support both encoder and decoder architectures without architectural changes, using lightweight fingerprints and adaptive eviction to maintain freshness. Empirical results across BERT and GPT-2 on WikiText-103, SQuAD, and OpenBookQA show up to 3.1x speedups with negligible accuracy loss, with high hit rates in lower/mid layers. The approach demonstrates practical system-level gains for real-time and large-scale transformer deployment and outlines design considerations and avenues for future improvements like dynamic thresholds and distributed caches.

Abstract

Transformer-based language models have achieved remarkable performance across a wide range of tasks, yet their high inference latency poses a significant challenge for real-timeand large-scale deployment. While existing caching mechanisms,such as token-level key-value caches, offer speedups in autore-gressive decoding, they are limited in scope and applicability. In this paper, we present LLMCache, a novel layer-wise caching framework that accelerates transformer inference by reusing intermediate activations based on semantic similarity of input sequences. Unlike prior work, LLMCache is model-agnostic,operates across both encoder and decoder architectures, and supports caching at arbitrary transformer layers. We introduce a lightweight fingerprinting mechanism for matching seman-tically similar inputs and propose adaptive eviction strategies to manage cache staleness. Experiments on BERT and GPT-2 across SQuAD, WikiText-103, and OpenBookQA show up to 3.1 X speedup in inference time with <0.5% accuracy degradation. Our results highlight LLMCache as a practical and general-purpose solution for optimizing transformer inference in real-world applications

Paper Structure

This paper contains 32 sections, 1 equation, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: High-level LLMCache system architecture. Each transformer layer is equipped with its own cache bank and lookup logic.
  • Figure 2: Inference flow in LLMCache showing fingerprint generation, cache lookup, reuse decision, and fallback computation.
  • Figure 3: Cache Hit Rate vs. Transformer Layer Index (GPT-2, WikiText)
  • Figure 4: Memory Overhead vs. Cache Hit Rate (BERT-base)
  • Figure 5: Cache Threshold Sensitivity: Varying $\tau$