Token-Level Uncertainty-Aware Objective for Language Model Post-Training
Tingkai Liu, Ari S. Benjamin, Anthony M. Zador
TL;DR
The paper tackles post-training challenges of large language models by distinguishing token-level epistemic and aleatoric uncertainties. It introduces an uncertainty-aware training objective that masks high-loss tokens (masked MLE) and regularizes via self-distillation to avoid overfitting, improving both in-distribution and out-of-distribution performance. Empirical results across Gemma, LLaMA, and Phi models on Alpaca, ShareGPT, and GSM8K demonstrate a token-level automatic curriculum and robustness benefits, with improved generalization and adaptability. The work highlights uncertainty estimation as a practical driver for training objectives, while noting computational and curriculum-design limitations and outlining future directions for joint data-model curricula.
Abstract
In the current work, we connect token-level uncertainty in causal language modeling to two types of training objectives: 1) masked maximum likelihood (MLE), 2) self-distillation. We show that masked MLE is effective in reducing epistemic uncertainty, and serve as an effective token-level automatic curriculum learning technique. However, masked MLE is prone to overfitting and requires self-distillation regularization to improve or maintain performance on out-of-distribution tasks. We demonstrate significant performance gain via the proposed training objective - combined masked MLE and self-distillation - across multiple architectures (Gemma, LLaMA, Phi) and datasets (Alpaca, ShareGPT, GSM8K), mitigating overfitting while maintaining adaptability during post-training. Our findings suggest that uncertainty-aware training provides an effective mechanism for enhancing language model training.
