Table of Contents
Fetching ...

NDP: Next Distribution Prediction as a More Broad Target

Junhao Ruan, Abudukeyumu Abudula, Xinyu Liu, Bei Li, Yinqiao Li, Chenglong Wang, Yuchun Fan, Yuan Ge, Tong Xiao, Jingbo Zhu

TL;DR

This paper argues that next-token prediction (NTP) suffers from a narrow-candidate target problem and limited lookahead, proposing Next Distribution Prediction (NDP) which replaces one-hot targets with $n$-gram distribution targets learned from world-data statistics. By constructing separate $n$-gram tables for supervised (instruction) and CLM (answer) components and fusing them into a non-one-hot target D_mix, NDP aligns training targets more closely with real-world data distributions, as approximated by LLM outputs. Across translation, general tasks, language transfer, and domain adaptation, NDP yields consistent improvements over NTP, with notable gains in medical-domain tasks and the ability to unify continued pre-training with fine-tuning without extra online training. The results indicate a promising direction for refining target distributions in LLM training and point to broader opportunities for combining supervised and unsupervised signals in flexible, domain-adaptive training regimes.

Abstract

Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the $n$-gram distribution and the one-hot distribution with LLMs, we observed that the $n$-gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses $n$-gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.

NDP: Next Distribution Prediction as a More Broad Target

TL;DR

This paper argues that next-token prediction (NTP) suffers from a narrow-candidate target problem and limited lookahead, proposing Next Distribution Prediction (NDP) which replaces one-hot targets with -gram distribution targets learned from world-data statistics. By constructing separate -gram tables for supervised (instruction) and CLM (answer) components and fusing them into a non-one-hot target D_mix, NDP aligns training targets more closely with real-world data distributions, as approximated by LLM outputs. Across translation, general tasks, language transfer, and domain adaptation, NDP yields consistent improvements over NTP, with notable gains in medical-domain tasks and the ability to unify continued pre-training with fine-tuning without extra online training. The results indicate a promising direction for refining target distributions in LLM training and point to broader opportunities for combining supervised and unsupervised signals in flexible, domain-adaptive training regimes.

Abstract

Large language models (LLMs) trained on next-token prediction (NTP) paradigm have demonstrated powerful capabilities. However, the existing NTP paradigm contains several limitations, particularly related to planned task complications and error propagation during inference. In our work, we extend the critique of NTP, highlighting its limitation also due to training with a narrow objective: the prediction of a sub-optimal one-hot distribution. To support this critique, we conducted a pre-experiment treating the output distribution from powerful LLMs as efficient world data compression. By evaluating the similarity between the -gram distribution and the one-hot distribution with LLMs, we observed that the -gram distributions align more closely with the output distribution of LLMs. Based on this insight, we introduce Next Distribution Prediction (NDP), which uses -gram distributions to replace the one-hot targets, enhancing learning without extra online training time. We conducted experiments across translation, general task, language transfer, and medical domain adaptation. Compared to NTP, NDP can achieve up to +2.97 COMET improvement in translation tasks, +0.61 average improvement in general tasks, and incredible +10.75 average improvement in the medical domain. This demonstrates the concrete benefits of addressing the target narrowing problem, pointing to a new direction for future work on improving NTP.
Paper Structure (26 sections, 3 equations, 5 figures, 3 tables)

This paper contains 26 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: (a): Changes in the sharpness of model distributions with increasing model size. (b): Changes in $\texttt{Sim}_{ngram}$/$\texttt{Sim}_{ntp}$ with calibrated preformance increases.
  • Figure 2: Overall framework of NDP. The numbers in the squares represent the token index in vocabulary. The $n$-grams required for the supervised table are counted starting from the question/input, meaning the counting begins from . The CLM table only counts from the answer/output, which means the counting starts from .
  • Figure 3: Comparison of COMET22 scores for different models on WMT22 and IWSLT2017 datasets
  • Figure 4: Dataset configuration. Numbers represent tokens, and tilde represents a token that does not repeat with other tokens. The items marked in red font indicate that we will observe its fitting accuracy. are used to represent the common prefix of the input, represent the different suffixes of the input, the blue blocks represent the target to be predicted, and represent the irrelevant tokens. $n=40$ in our setting.
  • Figure 5: Analysis of model convergence with increasing training epochs. The left figure shows the similarity of the model's output distribution on the target items. The right figure shows the similarity with irrelevant items.