Table of Contents
Fetching ...

Self-Distillation for Multi-Token Prediction

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

Abstract

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

Self-Distillation for Multi-Token Prediction

Abstract

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.
Paper Structure (25 sections, 10 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 10 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of the gradient-detached, Top$N$-logits-selected self-distillation method.
  • Figure 2: Illustration of the training strategy for looped extension of MTP-D. The gray blocks represent the frozen main model and the trained MTP heads from 1 to $m$. The weights of the MTP heads from 1 to $m$ are copied to initialize the MTP heads from $m{+}1$ to $2m$. The orange blocks denote the trainable MTP heads.
  • Figure 3: Acceptance rate and cumulative acceptance rate of MTP heads for different models under training-free looped extension. Using the AGIEval-en benchmark as an example, (a) shows the cumulative acceptance rate as the loop is extended to 8 MTP heads for different models, while (b) presents the acceptance rate of each MTP head.
  • Figure 4: Speedup ratios of the 2B Dense model under different MTP methods and $K$ settings across multiple benchmarks, where the inference speed of 1 head MTP serves as the baseline.
  • Figure 5: Comparison of performance across different looped extension strategies with up to 8 loops. The 1-head MTP serves as the baseline.
  • ...and 9 more figures