Self-Distillation for Multi-Token Prediction

Guoliang Zhao; Ruobing Xie; An Wang; Shuaipeng Li; Huaibing Xie; Xingwu Sun

Self-Distillation for Multi-Token Prediction

Guoliang Zhao, Ruobing Xie, An Wang, Shuaipeng Li, Huaibing Xie, Xingwu Sun

Abstract

As Large Language Models (LLMs) scale up, inference efficiency becomes a critical bottleneck. Multi-Token Prediction (MTP) could accelerate LLM inference by predicting multiple future tokens in parallel. However, existing MTP approaches still face two challenges: limited acceptance rates of MTP heads, and difficulties in jointly training multiple MTP heads. Therefore, we propose MTP-D, a simple yet effective self-distillation method with minimal additional training cost, which boosts MTP head acceptance rates (+7.5\%) while maximumly preserving main-head performance. We also introduce a looped extension strategy for MTP-D, enabling effective and economical MTP head extension and further significant inference speedup to 1-head MTP (+220.4\%). Moreover, we systematically explore and validate key insights on the distillation strategies and the potential scalability of MTP through extensive experiments on seven benchmarks. These results demonstrate that our MTP-D and looped extension strategy effectively enhance MTP-head performance and inference efficiency, facilitating the practical usage of MTP in LLMs.

Self-Distillation for Multi-Token Prediction

Abstract

Paper Structure (25 sections, 10 equations, 14 figures, 8 tables, 1 algorithm)

This paper contains 25 sections, 10 equations, 14 figures, 8 tables, 1 algorithm.

Introduction
Preliminary
Method
Existing Issues of MTP
Self-Distillation for MTP in Pre-Training
Looped MTP Head Extension in Continue Pre-Training
Experiments
Experimental Setup
Main Experiments
Ablation Study and Model Analysis
Results of Looped MTP Extension
Related Work
Conclusion
The Use of LLMs
Description of Model Configurations and Training Details
...and 10 more sections

Figures (14)

Figure 1: Overview of the gradient-detached, Top$N$-logits-selected self-distillation method.
Figure 2: Illustration of the training strategy for looped extension of MTP-D. The gray blocks represent the frozen main model and the trained MTP heads from 1 to $m$. The weights of the MTP heads from 1 to $m$ are copied to initialize the MTP heads from $m{+}1$ to $2m$. The orange blocks denote the trainable MTP heads.
Figure 3: Acceptance rate and cumulative acceptance rate of MTP heads for different models under training-free looped extension. Using the AGIEval-en benchmark as an example, (a) shows the cumulative acceptance rate as the loop is extended to 8 MTP heads for different models, while (b) presents the acceptance rate of each MTP head.
Figure 4: Speedup ratios of the 2B Dense model under different MTP methods and $K$ settings across multiple benchmarks, where the inference speed of 1 head MTP serves as the baseline.
Figure 5: Comparison of performance across different looped extension strategies with up to 8 loops. The 1-head MTP serves as the baseline.
...and 9 more figures

Self-Distillation for Multi-Token Prediction

Abstract

Self-Distillation for Multi-Token Prediction

Authors

Abstract

Table of Contents

Figures (14)