Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap
Chun-Hao Yang, Bo-Han Feng, Tzu-Yuan Lai, Yan Yu Chen, Yin-Kai Dean Huang, Shou-De Lin
TL;DR
This work argues that standard next-token prediction is not universally optimal and proposes training-time token ordering guided by mutual information to select information-rich target tokens. It introduces a deterministic Max($MI(S;t)$) strategy and a reversible preprocessing framework that reorders targets before training, with an MI-based augmentation for text generation. Across arithmetic, multi-label classification, and text generation, the approach yields consistent improvements on both small and large models (e.g., GPT-2, Qwen-Math, Llama), highlighting that token informativeness interacts with task structure and pretraining biases. The findings suggest a practical, language-aware alternative for sequence optimization during pretraining, potentially reducing error accumulation and improving robustness in diverse NLP tasks.
Abstract
Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.
