CREST: Effectively Compacting a Datastore For Retrieval-Based Speculative Decoding
Sophia Ho, Jinsol Park, Patrick Wang
TL;DR
CREST addresses the unbounded storage growth of REST's datastore for retrieval-based speculative decoding by decoupling contexts (n-grams) from continuations into a dictionary and storing only the smallest, most common n-grams mapped to precomputed token trees. This enables a disk-native, fast-on-disk lookup with $O(1)$ (or $O(\log n)$) access and significantly reduces storage while maintaining or improving drafting performance. On MT Bench and HumanEval with Vicuna 7B and CodeLlama 7B, CREST matches REST's accepted length using $10.6$–$13.5\times$ less storage and achieves $16.5$–$17.1\%$ higher accepted length, demonstrating scalable, high-performance speculative decoding without additional drafting models. Overall, CREST provides a robust, compact architecture that scales with dataset size and improves practical drafting efficiency.
Abstract
We present CREST (Compact Retrieval-Based Speculative Decoding), a redesign of REST that allows it to be effectively "compacted". REST is a drafting technique for speculative decoding based on retrieving exact n-gram matches of the most recent n tokens generated by the target LLM from a datastore. The key idea of CREST is to only store a subset of the smallest and most common n-grams in the datastore with the hope of achieving comparable performance with less storage space. We found that storing a subset of n-grams both reduces storage space and improves performance. CREST matches REST's accepted token length with 10.6-13.5x less storage space and achieves a 16.5-17.1% higher acceptance length than REST using the same storage space on the HumanEval and MT Bench benchmarks.
