Table of Contents
Fetching ...

Token-Level Uncertainty-Aware Objective for Language Model Post-Training

Tingkai Liu, Ari S. Benjamin, Anthony M. Zador

TL;DR

The paper tackles post-training challenges of large language models by distinguishing token-level epistemic and aleatoric uncertainties. It introduces an uncertainty-aware training objective that masks high-loss tokens (masked MLE) and regularizes via self-distillation to avoid overfitting, improving both in-distribution and out-of-distribution performance. Empirical results across Gemma, LLaMA, and Phi models on Alpaca, ShareGPT, and GSM8K demonstrate a token-level automatic curriculum and robustness benefits, with improved generalization and adaptability. The work highlights uncertainty estimation as a practical driver for training objectives, while noting computational and curriculum-design limitations and outlining future directions for joint data-model curricula.

Abstract

In the current work, we connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks. We demonstrate significant performance gain via the proposed training objective - combined masked MLE and self-distillation - across multiple architectures (Gemma, LLaMA, Phi) and datasets (Alpaca, ShareGPT, GSM8K), mitigating overfitting while maintaining adaptability during post-training. Our findings suggest that uncertainty-aware training provides an effective mechanism for enhancing language model training.

Token-Level Uncertainty-Aware Objective for Language Model Post-Training

TL;DR

The paper tackles post-training challenges of large language models by distinguishing token-level epistemic and aleatoric uncertainties. It introduces an uncertainty-aware training objective that masks high-loss tokens (masked MLE) and regularizes via self-distillation to avoid overfitting, improving both in-distribution and out-of-distribution performance. Empirical results across Gemma, LLaMA, and Phi models on Alpaca, ShareGPT, and GSM8K demonstrate a token-level automatic curriculum and robustness benefits, with improved generalization and adaptability. The work highlights uncertainty estimation as a practical driver for training objectives, while noting computational and curriculum-design limitations and outlining future directions for joint data-model curricula.

Abstract

In the current work, we connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks. We demonstrate significant performance gain via the proposed training objective - combined masked MLE and self-distillation - across multiple architectures (Gemma, LLaMA, Phi) and datasets (Alpaca, ShareGPT, GSM8K), mitigating overfitting while maintaining adaptability during post-training. Our findings suggest that uncertainty-aware training provides an effective mechanism for enhancing language model training.

Paper Structure

This paper contains 23 sections, 9 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Proposed training procedure combines maximum likelihood with self-distillation training objective to improve both in-distribution and out-of-distribution performances.
  • Figure 2: (A) Token level uncertainties, predictive loss and entropy for Gemma-2B-it. Note that only tokens in the completion are color coded by the various uncertainty and loss metrics. (B) Correlation between uncertainty (epistemic/aleatoric) and model metrics (predictive loss/entropy) in language modeling of Alpaca dataset across models. (C) Effect of training on different data subset with varying degree of aleatoric/epistemic uncertainty varied based on distance from solution.
  • Figure 3: In-Distribution performance gain of token-level masked MLE compared to baseline (vanilla MLE) and document-level masked MLE. Models were trained on (top) Alpaca and (bottom) GSM8K via masked MLE objective on tokens wiht top 25% quantile metric value.
  • Figure 4: Example of automatic curriculum learning as a result of training on high epistemic uncertainty (color-coded) tokens. Note that while both prompt and response tokens are color coded, losses are only computed and propagated for response tokens during training.
  • Figure 5: Downstream performance of (top) Llama-3.2-1B and (bottom) Gemma-2B trained on Alpaca dataset with different training objectives.
  • ...and 7 more figures