Table of Contents
Fetching ...

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

Xianzhen Luo, Yixuan Wang, Qingfu Zhu, Zhiming Zhang, Xuanyu Zhang, Qing Yang, Dongliang Xu

TL;DR

Token Recycling introduces an adjacency-matrix–based cache of candidate tokens and a BFS-like draft-tree retrieval, verified by tree attention, to accelerate autoregressive LLM inference without training. It continuously updates the retrieval space with new candidates during decoding, achieving roughly 2x speedups across model sizes while using less than 2MB of extra storage. The approach outperforms prior train-free methods by about 30% and even competes with training-based methods, particularly benefiting high-redundancy or code-domain tasks. TR’s plug-and-play design, low memory footprint, and robust performance across tasks suggest wide practical impact for real-time LLM applications.

Abstract

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30\% and even a widely recognized training method by 25\%.

Turning Trash into Treasure: Accelerating Inference of Large Language Models with Token Recycling

TL;DR

Token Recycling introduces an adjacency-matrix–based cache of candidate tokens and a BFS-like draft-tree retrieval, verified by tree attention, to accelerate autoregressive LLM inference without training. It continuously updates the retrieval space with new candidates during decoding, achieving roughly 2x speedups across model sizes while using less than 2MB of extra storage. The approach outperforms prior train-free methods by about 30% and even competes with training-based methods, particularly benefiting high-redundancy or code-domain tasks. TR’s plug-and-play design, low memory footprint, and robust performance across tasks suggest wide practical impact for real-time LLM applications.

Abstract

Massive parameters of LLMs have made inference latency a fundamental bottleneck. Speculative decoding represents a lossless approach to accelerate inference through a guess-and-verify paradigm. Some methods rely on additional architectures to guess draft tokens, which need extra training before use. Alternatively, retrieval-based training-free techniques build libraries from pre-existing corpora or by n-gram generation. However, they face challenges like large storage requirements, time-consuming retrieval, and limited adaptability. Observing that candidate tokens generated during the decoding process are likely to reoccur in future sequences, we propose Token Recycling. It stores candidate tokens in an adjacency matrix and employs a breadth-first-search (BFS)-like algorithm to construct a draft tree, which is then validated through tree attention. New candidate tokens from the decoding process are then used to update the matrix. Token Recycling requires \textless2MB of additional storage and achieves approximately 2x speedup across all sizes of LLMs. It significantly outperforms existing train-free methods by 30\% and even a widely recognized training method by 25\%.
Paper Structure (34 sections, 9 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 34 sections, 9 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: A comparison of typical speculative decoding and Token Recycling (TR). Typical methods draft some tokens and verify them in parallel in one decoding step. Unlike other methods that discard candidate tokens, TR stores them in an adjacency matrix. In future generations, draft tokens are retrieved from the matrix which is updated with new candidate tokens. TR effectively recycles tokens in the decoding process.
  • Figure 2: An overview of Token Recycling (TR). The adjacency matrix, initialized by the existing matrix, stores candidate tokens. TR first retrieves a draft tree from the matrix which is then verified through tree attention. After add the longest correct sequence to the content, the new top-k candidate tokens update the matrix.
  • Figure 3: Effects of tree breadth, depth and updating strategies on MAT and Tokens/s are in (a), (b), and (c).
  • Figure 4: MAT and Speedup ratio under different temperatures during generation.
  • Figure 5: Time allocation for each operation when LLMs respond to a query.
  • ...and 1 more figures