Table of Contents
Fetching ...

Adaptive Draft-Verification for Efficient Large Language Model Decoding

Xukun Liu, Bowen Lei, Ruqi Zhang, Dongkuan Xu

TL;DR

This paper tackles the latency bottleneck in autoregressive LLM decoding by introducing ADED, a fine-tuning-free framework that adaptively constructs and verifies drafts to align with the true LLM output distribution. Central to ADED are a tri-gram matrix-based LLM representative that evolves during decoding and an MCTS-inspired draft maker that balances exploration and exploitation, guided by a PUCT-based scoring mechanism. The draft-verification loop continuously updates the tri-gram representation, enabling self-improvement and reduced latency without retraining, while tree attention ensures drafts remain faithful to autoregressive behavior. Empirical results across diverse models and benchmarks show up to 2.5x speedups and improved acceptance rates, with lower memory footprints and robust performance across tasks, making it well-suited for latency-sensitive and edge deployments.

Abstract

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM. The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.

Adaptive Draft-Verification for Efficient Large Language Model Decoding

TL;DR

This paper tackles the latency bottleneck in autoregressive LLM decoding by introducing ADED, a fine-tuning-free framework that adaptively constructs and verifies drafts to align with the true LLM output distribution. Central to ADED are a tri-gram matrix-based LLM representative that evolves during decoding and an MCTS-inspired draft maker that balances exploration and exploitation, guided by a PUCT-based scoring mechanism. The draft-verification loop continuously updates the tri-gram representation, enabling self-improvement and reduced latency without retraining, while tree attention ensures drafts remain faithful to autoregressive behavior. Empirical results across diverse models and benchmarks show up to 2.5x speedups and improved acceptance rates, with lower memory footprints and robust performance across tasks, making it well-suited for latency-sensitive and edge deployments.

Abstract

Large language model (LLM) decoding involves generating a sequence of tokens based on a given context, where each token is predicted one at a time using the model's learned probabilities. The typical autoregressive decoding method requires a separate forward pass through the model for each token generated, which is computationally inefficient and poses challenges for deploying LLMs in latency-sensitive scenarios. The main limitations of current decoding methods stem from their inefficiencies and resource demands. Existing approaches either necessitate fine-tuning smaller models, which is resource-intensive, or rely on fixed retrieval schemes to construct drafts for the next tokens, which lack adaptability and fail to generalize across different models and contexts. To address these issues, we introduce a novel methodology called ADED, which accelerates LLM decoding without requiring fine-tuning. Our approach involves an adaptive draft-verification process that evolves over time to improve efficiency. We utilize a tri-gram matrix-based LLM representation to dynamically approximate the output distribution of the LLM, allowing the model to adjust to changing token probabilities during the decoding process. Additionally, we implement a draft construction mechanism that effectively balances exploration and exploitation, ensuring that the drafts generated are both diverse and close to the true output distribution of the LLM. The importance of this design lies in its ability to optimize the draft distribution adaptively, leading to faster and more accurate decoding. Through extensive experiments on various benchmark datasets and LLM architectures, we demonstrate that ADED significantly accelerates the decoding process while maintaining high accuracy, making it suitable for deployment in a wide range of practical applications.
Paper Structure (34 sections, 17 equations, 9 figures, 3 tables, 2 algorithms)

This paper contains 34 sections, 17 equations, 9 figures, 3 tables, 2 algorithms.

Figures (9)

  • Figure 1: Comparison of different LLM decoding strategies. In Speculative Decoding, a small LLM generates predictions (red blocks) from inputs (blue blocks). Yellow blocks indicating intermediate results obtained from language model. Lookahead uses a large LLM for forward-looking predictions. REST employs a corpus trie for rapid token lookups. ADED integrates Monte Carlo Tree Search with tri-gram statistics and recent token history to simulate potential outputs, refining its recommendations over time. ADED's adaptive approach offers advantages in terms of speed and accuracy by continuously evolving its draft constructions.
  • Figure 2: This figure illustrates the data processing workflow of ADED. Initially, the input tokens undergo preprocessing to calculate their tri-grams, which serve to update the tri-gram matrix. Subsequently, the updated matrix, in conjunction with the last two tokens of the input, is used to retrieve potential token sequences. These sequences are ranked, and the top-k sequences are selected, and then appended to the original input. Finally, these extended sequences are inputted into the Large Language Model for prediction.
  • Figure 3: Comparison of ADED's throughput for different models on (a) MT-Bench, (b) Alpaca, and (c) Human-Eval. The performance of ADED shows stable and significant improvements across different models and benchmarks.
  • Figure 4: (\ref{['fig:adaptive']}) Adaptive Strategy comparison on MTBench: Performance of Vicuna-7B model with and without the adaptive strategy on the MT-Bench dataset, showing the advantage of using the adaptive approach. (\ref{['fig:task_stability']}) Average Accept Length for different tasks on MT-Bench, demonstrating that ADED consistently performs well across tasks.
  • Figure 5: Sensitivity analysis of ADED on (\ref{['fig:top_p_sensitivity']}) top-p and (\ref{['fig:temperature_sensitivity']}) temperature parameters.
  • ...and 4 more figures