Table of Contents
Fetching ...

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

Chenxi Sun, Hongzhi Zhang, Zijia Lin, Jingyuan Zhang, Fuzheng Zhang, Zhongyuan Wang, Bin Chen, Chengru Song, Di Zhang, Kun Gai, Deyi Xiong

TL;DR

Lexical Unit Decoding (LUD) tackles the decoding bottleneck in decoder-only LLMs by identifying high-confidence lexical units—spans of contiguous tokens that can be predicted in parallel—and training the model to emit multiple tokens at once without architectural changes. The approach combines a look-ahead inference strategy, adaptive span acceptance, and repetition control, along with a data-centric pipeline that marks lexical units and reconfigures training data to foster parallel decoding. Empirical results on text and code generation (text: 33% speedup with no quality loss; code: 30% speedup with ~3% quality loss) demonstrate strong speedups while preserving output integrity, and analyses reveal coherent parallel tokens and a predictable distribution of parallel spans. The method is complementary to existing acceleration techniques and suggests a new decoding paradigm for future LLMs, with public code released for reproduction and extension.

Abstract

Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\% speed up on code generation with a negligible quality loss of 3\%. Distinctively, LUD requires no auxiliary models and does not require changes to existing architectures. It can also be integrated with other decoding acceleration methods, thus achieving an even more pronounced inference efficiency boost. We posit that the foundational principles of LUD could define a new decoding paradigm for future language models, enhancing their applicability for a broader spectrum of applications. All codes are be publicly available at https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD-. Keywords: Parallel Decoding, Lexical Unit Decoding, Large Language Model

Decoding at the Speed of Thought: Harnessing Parallel Decoding of Lexical Units for LLMs

TL;DR

Lexical Unit Decoding (LUD) tackles the decoding bottleneck in decoder-only LLMs by identifying high-confidence lexical units—spans of contiguous tokens that can be predicted in parallel—and training the model to emit multiple tokens at once without architectural changes. The approach combines a look-ahead inference strategy, adaptive span acceptance, and repetition control, along with a data-centric pipeline that marks lexical units and reconfigures training data to foster parallel decoding. Empirical results on text and code generation (text: 33% speedup with no quality loss; code: 30% speedup with ~3% quality loss) demonstrate strong speedups while preserving output integrity, and analyses reveal coherent parallel tokens and a predictable distribution of parallel spans. The method is complementary to existing acceleration techniques and suggests a new decoding paradigm for future LLMs, with public code released for reproduction and extension.

Abstract

Large language models have demonstrated exceptional capability in natural language understanding and generation. However, their generation speed is limited by the inherently sequential nature of their decoding process, posing challenges for real-time applications. This paper introduces Lexical Unit Decoding (LUD), a novel decoding methodology implemented in a data-driven manner, accelerating the decoding process without sacrificing output quality. The core of our approach is the observation that a pre-trained language model can confidently predict multiple contiguous tokens, forming the basis for a \textit{lexical unit}, in which these contiguous tokens could be decoded in parallel. Extensive experiments validate that our method substantially reduces decoding time while maintaining generation quality, i.e., 33\% speed up on natural language generation with no quality loss, and 30\% speed up on code generation with a negligible quality loss of 3\%. Distinctively, LUD requires no auxiliary models and does not require changes to existing architectures. It can also be integrated with other decoding acceleration methods, thus achieving an even more pronounced inference efficiency boost. We posit that the foundational principles of LUD could define a new decoding paradigm for future language models, enhancing their applicability for a broader spectrum of applications. All codes are be publicly available at https://github.com/tjunlp-lab/Lexical-Unit-Decoding-LUD-. Keywords: Parallel Decoding, Lexical Unit Decoding, Large Language Model
Paper Structure (39 sections, 8 equations, 7 figures, 1 table, 1 algorithm)

This paper contains 39 sections, 8 equations, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of "lexical units" as consecutive token spans. These units, as conceptualized in our study, can potentially be identified and decoded in parallel, enhancing the decoding speed of LLMs.
  • Figure 2: Illustration of the lexical unit decoding procedure. In the decoding process, we look ahead $k=5$ tokens by appending $k-1=4$ tokens and retrieving the last $k$ predicted tokens with their probabilities. However, we only accept consecutive tokens with probabilities larger than $\alpha=0.9$.
  • Figure 3: Visualization of the streamlined Data Generation process. Given a sequence of tokens and their corresponding probabilities, lexical units are segmented based on a threshold $\alpha=0.9$. Probabilities above the threshold are highlighted in green. Lexical units can consist of either a single token or multiple tokens. Multi-token lexical units are appended with [PAD] tokens to enable the training of parallel decoding. For individual tokens with lower prediction confidence, the model falls back to the auto-regressive training manner to maintain output quality. In practice, the second and third instances with a lexical unit of length $1$ are combined as one, which is basically the same to a standard auto-regressive training example.
  • Figure 4: Quality and Acceleration curves of text generation
  • Figure 5: Quality and Acceleration curves of code generation
  • ...and 2 more figures