Table of Contents
Fetching ...

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

Chun-Hao Yang, Bo-Han Feng, Tzu-Yuan Lai, Yan Yu Chen, Yin-Kai Dean Huang, Shou-De Lin

TL;DR

This work argues that standard next-token prediction is not universally optimal and proposes training-time token ordering guided by mutual information to select information-rich target tokens. It introduces a deterministic Max($MI(S;t)$) strategy and a reversible preprocessing framework that reorders targets before training, with an MI-based augmentation for text generation. Across arithmetic, multi-label classification, and text generation, the approach yields consistent improvements on both small and large models (e.g., GPT-2, Qwen-Math, Llama), highlighting that token informativeness interacts with task structure and pretraining biases. The findings suggest a practical, language-aware alternative for sequence optimization during pretraining, potentially reducing error accumulation and improving robustness in diverse NLP tasks.

Abstract

Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.

Training LLMs Beyond Next Token Prediction -- Filling the Mutual Information Gap

TL;DR

This work argues that standard next-token prediction is not universally optimal and proposes training-time token ordering guided by mutual information to select information-rich target tokens. It introduces a deterministic Max() strategy and a reversible preprocessing framework that reorders targets before training, with an MI-based augmentation for text generation. Across arithmetic, multi-label classification, and text generation, the approach yields consistent improvements on both small and large models (e.g., GPT-2, Qwen-Math, Llama), highlighting that token informativeness interacts with task structure and pretraining biases. The findings suggest a practical, language-aware alternative for sequence optimization during pretraining, potentially reducing error accumulation and improving robustness in diverse NLP tasks.

Abstract

Optimizing training performance in large language models (LLMs) remains an essential challenge, particularly in improving model performance while maintaining computational costs. This work challenges the conventional approach of training LLMs using next-token prediction (NTP), arguing that by predicting information-rich tokens during training, there is a more effective way to train LLMs. We investigate the impact of the proposed solution in three kinds of tasks for LLMs: arithmetic, multi-label classification of text, and natural-language generation. This work offers a principled approach to optimizing LLM training, advancing both model performance and theoretical understanding of the target-token selection strategies.

Paper Structure

This paper contains 29 sections, 10 equations, 3 figures, 8 tables, 1 algorithm.

Figures (3)

  • Figure 1: Examples of target sequence rearrangement in 3 tasks. (a) Arithmetic tasks (e.g. $35 \times 07 = 0245$): Reordering the digits of the numerical answer. (b) Multi-label text classification: Determining the prediction order of labels. (c) Text generation: Inserting a selected token at the beginning of each sentence.
  • Figure 2: Training curves of different target token orders with significant differences (not displaying all permutations of target token orders due to complexity). (a) 3-digit Addition ${A_1A_2A_3 + B_1B_2B_3 = C_1C_2C_3C_4}$ on NanoGPT (0.09M parameters). (b) 2-digit Multiplication ${A_1A_2 \times B_1B_2 = C_1C_2C_3C_4}$ on GPT-2-mini (2.67M parameters). (c) Multi-label classification on Qwen-2.5-1.5B-Instruct using ToxicComment dataset, whose labels [$toxic, obscene, insult,identity\_hate$] are retained and marked as [${C_1,C_2,C_3,C_4}$]. Each curve represents a distinct token order. (d) Text generation on GPT-2-small (137M parameters) using WikiText-2 dataset. Abbreviations: Acc --– Maximal accuracy at fixed iteration, Rank –-- the rank of Acc among all permutations of target token orders.
  • Figure 3: Data format for TG with selected words placed between special tokens $\left[\text{START}\right]$ and $\left[\text{END}\right]$.