Table of Contents
Fetching ...

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

Shikhar Tuli, Chi-Heng Lin, Yen-Chang Hsu, Niraj K. Jha, Yilin Shen, Hongxia Jin

TL;DR

DynaMo presents a dynamic multi-token prediction framework to accelerate language model inference by jointly predicting multiple tokens and dynamically backing off to lower-order predictions when confidence is low. It introduces a modified CLM objective, co-occurrence weighted masking, and adaptive thresholding to approximate and refine joint token distributions, reusing weights from autoregressive baselines to minimize training overhead. Empirical results show notable speed-ups (e.g., up to 2.57×) with minimal parameter and training-time overhead, while achieving same-quality generation on a 7.3B model compared to a baseline of 6.9B, and improving downstream task performance on several benchmarks. The approach offers a practical path toward faster, edge-friendly LLMs without sacrificing generation quality, supported by thorough evaluation using NLU benchmarks, multi-token perplexity, and open-ended generation assessments.

Abstract

Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models $\textit{dynamically}$ predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57$\times$ speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.

DynaMo: Accelerating Language Model Inference with Dynamic Multi-Token Sampling

TL;DR

DynaMo presents a dynamic multi-token prediction framework to accelerate language model inference by jointly predicting multiple tokens and dynamically backing off to lower-order predictions when confidence is low. It introduces a modified CLM objective, co-occurrence weighted masking, and adaptive thresholding to approximate and refine joint token distributions, reusing weights from autoregressive baselines to minimize training overhead. Empirical results show notable speed-ups (e.g., up to 2.57×) with minimal parameter and training-time overhead, while achieving same-quality generation on a 7.3B model compared to a baseline of 6.9B, and improving downstream task performance on several benchmarks. The approach offers a practical path toward faster, edge-friendly LLMs without sacrificing generation quality, supported by thorough evaluation using NLU benchmarks, multi-token perplexity, and open-ended generation assessments.

Abstract

Traditional language models operate autoregressively, i.e., they predict one token at a time. Rapid explosion in model sizes has resulted in high inference times. In this work, we propose DynaMo, a suite of multi-token prediction language models that reduce net inference times. Our models predict multiple tokens based on their confidence in the predicted joint probability distribution. We propose a lightweight technique to train these models, leveraging the weights of traditional autoregressive counterparts. Moreover, we propose novel ways to enhance the estimated joint probability to improve text generation quality, namely co-occurrence weighted masking and adaptive thresholding. We also propose systematic qualitative and quantitative methods to rigorously test the quality of generated text for non-autoregressive generation. One of the models in our suite, DynaMo-7.3B-T3, achieves same-quality generated text as the baseline (Pythia-6.9B) while achieving 2.57 speed-up with only 5.87% and 2.67% parameter and training time overheads, respectively.
Paper Structure (46 sections, 1 theorem, 8 equations, 24 figures, 13 tables, 1 algorithm)

This paper contains 46 sections, 1 theorem, 8 equations, 24 figures, 13 tables, 1 algorithm.

Key Result

Theorem 1

When the cost function $c({\mathbf{x}}_{t+1}, {\mathbf{x}}_{t+2}, \ldots, {\mathbf{x}}_{t+n}) = -\log ( \frac{ \hat{p}({\mathbf{x}}_{t+1:t+n}) }{ \prod_{i=1}^n \hat{p}({\mathbf{x}}_{t+i}) } )$ and $\epsilon_2 = 0$ [defined in Eq. (eq:optimal_transport)], the joint probability distribution in Eq. (eq

Figures (24)

  • Figure 1: Multi-token prediction in DynaMo. (a) Traditional autoregressive prediction requires three forward passes. (b) Non-autoregressive multi-token prediction requires only one forward pass.
  • Figure 2: Flowchart of the proposed dynamic multi-token prediction pipeline.
  • Figure 3: Win rate vs. speed-up for pairwise comparisons on the sentence-completion benchmark with corresponding Pythia models as baselines. GPT-3.5 is used as a judge. Regression plotted with 95% confidence intervals. Same-quality speed-ups are shown in parentheses. Theoretical same-quality speed-ups are marked with an asterisk (*).
  • Figure 4: Pairwise performance of the DynaMo and Pythia models on the Vicuna benchmark. GPT-4 was used as a judge. The actual number of wins, ties, and losses are colored green, yellow, and red, respectively.
  • Figure 5: Percentage of unigram, bigram, and trigram generations vs. $\epsilon_b$ for DynaMo-70M-T3.
  • ...and 19 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof : Proof of Theorem 1.