Table of Contents
Fetching ...

CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

Sophia Ho, Jinsol Park, Patrick Wang

TL;DR

CREST addresses the unbounded storage growth of REST's datastore for retrieval-based speculative decoding by decoupling contexts (n-grams) from continuations into a dictionary and storing only the smallest, most common n-grams mapped to precomputed token trees. This enables a disk-native, fast-on-disk lookup with $O(1)$ (or $O(\log n)$) access and significantly reduces storage while maintaining or improving drafting performance. On MT Bench and HumanEval with Vicuna 7B and CodeLlama 7B, CREST matches REST's accepted length using $10.6$–$13.5\times$ less storage and achieves $16.5$–$17.1\%$ higher accepted length, demonstrating scalable, high-performance speculative decoding without additional drafting models. Overall, CREST provides a robust, compact architecture that scales with dataset size and improves practical drafting efficiency.

Abstract

We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.

CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding

TL;DR

CREST addresses the unbounded storage growth of REST's datastore for retrieval-based speculative decoding by decoupling contexts (n-grams) from continuations into a dictionary and storing only the smallest, most common n-grams mapped to precomputed token trees. This enables a disk-native, fast-on-disk lookup with (or ) access and significantly reduces storage while maintaining or improving drafting performance. On MT Bench and HumanEval with Vicuna 7B and CodeLlama 7B, CREST matches REST's accepted length using less storage and achieves higher accepted length, demonstrating scalable, high-performance speculative decoding without additional drafting models. Overall, CREST provides a robust, compact architecture that scales with dataset size and improves practical drafting efficiency.

Abstract

We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
Paper Structure (18 sections, 9 figures, 2 tables)

This paper contains 18 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: An example suffix array using the token chunk "mlsystems", where each character is a token. Note that in REST, each token is a 32-bit integer and not a character. We made each token a character in this example only for ease of explanation.
  • Figure 2: An example of REST's drafting process. The context is the latest 2 generated tokens, or "machine learning". The exact matches of the context are highlighted in green in the datastore. The continuations are highlighted in orange.
  • Figure 3: Generation speed and average accepted length of REST with different starting context lengths. The graph is lifted directly from the REST paper. The settings are CodeLlama 7B with greedy sampling on HumanEval.
  • Figure 4: Frequency analysis for n-grams up to 5 for the ShareGPT and The Stack dataset.
  • Figure 5: An overview of the CREST system. Our contributions are highlighted in blue. Note that arrows are highlighted in addition to components because many challenges we faced involved integration issues, not just the core CREST components.
  • ...and 4 more figures