Pre-Training Curriculum for Multi-Token Prediction in Language Models
Ansar Aynetdinov, Alan Akbik
TL;DR
This work tackles the challenge of leveraging multi-token prediction (MTP) for small language models (SLMs) by introducing curriculum learning strategies to gradually adjust the number of prediction heads during pre-training. The authors propose forward and reverse MTP curricula and rigorously evaluate them across 1.3B and 3B decoder-only models, with subword and byte-level tokenizations, on datasets including MiniPile. Key findings show that the forward curriculum improves downstream NTP performance and generative quality while preserving self-speculative decoding, whereas the reverse curriculum achieves strong NTP performance and output quality but does not provide inference speedups. The results highlight a practical path to harness MTP for SLMs and point to byte-level models and forward curricula as the most effective combination for balancing accuracy, speed, and quality in pre-training regimes.
Abstract
Multi-token prediction (MTP) is a recently proposed pre-training objective for language models. Rather than predicting only the next token (NTP), MTP predicts the next $k$ tokens at each prediction step, using multiple prediction heads. MTP has shown promise in improving downstream performance, inference speed, and training efficiency, particularly for large models. However, prior work has shown that smaller language models (SLMs) struggle with the MTP objective. To address this, we propose a curriculum learning strategy for MTP training, exploring two variants: a forward curriculum, which gradually increases the complexity of the pre-training objective from NTP to MTP, and a reverse curriculum, which does the opposite. Our experiments show that the forward curriculum enables SLMs to better leverage the MTP objective during pre-training, improving downstream NTP performance and generative output quality, while retaining the benefits of self-speculative decoding. The reverse curriculum achieves stronger NTP performance and output quality, but fails to provide any self-speculative decoding benefits.
