Table of Contents
Fetching ...

Cacheback: Speculative Decoding With Nothing But Cache

Zhiyao Ma, In Gim, Lin Zhong

TL;DR

Cacheback tackles inference latency in LLMs by using an LRU-based n-gram cache to produce draft tokens for speculative decoding; a tree-attention mechanism validates drafts in a single forward pass and a dual-table initialization mitigates cold-start. Experiments on SpecBench show competitive speedups against other training-free strategies, with strong translation-domain performance. The approach is lightweight and readily integrable, enabling rapid domain adaptation.

Abstract

We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.

Cacheback: Speculative Decoding With Nothing But Cache

TL;DR

Cacheback tackles inference latency in LLMs by using an LRU-based n-gram cache to produce draft tokens for speculative decoding; a tree-attention mechanism validates drafts in a single forward pass and a dual-table initialization mitigates cold-start. Experiments on SpecBench show competitive speedups against other training-free strategies, with strong translation-domain performance. The approach is lightweight and readily integrable, enabling rapid domain adaptation.

Abstract

We present Cacheback Decoding, a training-free and model-agnostic speculative decoding method that exploits the locality in language to accelerate Large Language Model (LLM) inference. Cacheback leverages only Least Recently Used (LRU) cache tables of token n-grams to generate draft sequences. Cacheback achieves state-of-the-art performance among comparable methods despite its minimalist design, and its simplicity allows easy integration into existing systems. Cacheback also shows potential for fast adaptation to new domains.

Paper Structure

This paper contains 10 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: Cacheback's cache table structure. Each leader is associated with a list of followers. Entries are evicted using the least recently used (LRU) policy.
  • Figure 2: Overview of decoding steps. Cacheback generates a draft tree by recursively querying the cache table and verifies it in one forward pass of the LLM using tree attention. In this example, the last draft branch except its last token is accepted. Cacheback subsequently updates the cache table with the accepted tokens over a sliding window.
  • Figure 3: Wall-clock speedup ratio on SpecBench with Vicuna models. The radar plot shows the speedup on different task categories when running Vicuna 7B. Cacheback achieves superior or comparable performance to other training-free model-agnostic methods.
  • Figure 4: Speedup ratio of Cacheback on SpecBench running Vicuna 7B with different LL and FL settings.