Table of Contents
Fetching ...

Language Model Probabilities are Not Calibrated in Numeric Contexts

Charles Lovering, Michael Krumdick, Viet Dac Lai, Seth Ebner, Nilesh Kumar, Varshini Reddy, Rik Koncel-Kedziorski, Chris Tanner

TL;DR

This work demonstrates that state-of-the-art language models fail to calibrate their next-token probabilities to the numeric content embedded in textual contexts, even in simple two-option scenarios. By formalizing calibration with a context-defined distribution $P$ and the model output distribution $\Pi$, and evaluating across colors, wordproblems, and distributions with PM, WD, and RE metrics, the study reveals pervasive miscalibration and systematic biases. Instruction tuning tends to reduce entropy and induce mode collapse, while baseline strategies that overweight the higher-numeric option often outperform the models. The findings highlight significant practical risks for probabilistic reasoning tasks and call for targeted methods to align LM outputs with context-driven numeric likelihoods. Overall, the paper provides a rigorous, quantitative portrait of calibration gaps and the persistent influence of identity, order, and frequency effects on LM probabilities.

Abstract

Some statements have one well-defined continuation (e.g., "the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g., "the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and word frequency all impact calibration. For example, gpt-4o-mini often picks the first of two options presented in the prompt regardless of the options' implied likelihoods, whereas Llama-3.1-8B picks the second. Models do not allocate probability mass among valid options in a calibrated manner.

Language Model Probabilities are Not Calibrated in Numeric Contexts

TL;DR

This work demonstrates that state-of-the-art language models fail to calibrate their next-token probabilities to the numeric content embedded in textual contexts, even in simple two-option scenarios. By formalizing calibration with a context-defined distribution and the model output distribution , and evaluating across colors, wordproblems, and distributions with PM, WD, and RE metrics, the study reveals pervasive miscalibration and systematic biases. Instruction tuning tends to reduce entropy and induce mode collapse, while baseline strategies that overweight the higher-numeric option often outperform the models. The findings highlight significant practical risks for probabilistic reasoning tasks and call for targeted methods to align LM outputs with context-driven numeric likelihoods. Overall, the paper provides a rigorous, quantitative portrait of calibration gaps and the persistent influence of identity, order, and frequency effects on LM probabilities.

Abstract

Some statements have one well-defined continuation (e.g., "the Eiffel Tower is in [Paris]"), whereas others have a natural distribution over multiple options (e.g., "the weighted coin flip was [Heads/Tails].") We argue that language model (LM) outputs should capture these natural distributions. Our work specifically tests whether LM output probabilities are calibrated to numeric information within their textual contexts. For example, if the context (the prompt) concerns two equally likely options (e.g., heads or tails for a fair coin), the LM output probabilities should also be equal. Likewise, in a context with nonuniformly likely events (e.g., rolling a pair with two dice) an LM should output proportionate probabilities. However, we find that even in simple settings, the best LMs (1) are poorly calibrated and (2) have systematic biases: artifacts like word identity, word order, and word frequency all impact calibration. For example, gpt-4o-mini often picks the first of two options presented in the prompt regardless of the options' implied likelihoods, whereas Llama-3.1-8B picks the second. Models do not allocate probability mass among valid options in a calibrated manner.

Paper Structure

This paper contains 45 sections, 1 equation, 24 figures, 14 tables.

Figures (24)

  • Figure 1: Models are un-calibrated. In this example, gpt-4o over-weights the option with a higher count of items beyond the calibrated probability, predicting red with 99.7% probability when 50.2% is appropriate. We find consistent patterns of uncalibrated behavior.
  • Figure 2: Systematic Patterns in Model Behavior. Each cell corresponds to model behavior across 100 examples. The number is the rate the outputs are compatible with the given behavior. For example, 1.00 in the first top-left cell means that for 100/100 instances a majority of the probability mass is on the first option. The top-left cell corresponds to instances where purple is first in the prompt and white is second. High rates across multiple behaviors are impossible; they are mutually exclusive across the 100 instances. See \ref{['fig:examples-fig']} for a representative instance drawn from the top-left cell.
  • Figure 3: Systematic Patterns in Model Behavior for the colors dataset. Each bar shows the percent of the different behaviors models exhibit averaged across templates, color pairs, and numeric scales.
  • Figure 4: Representative Examples of Model Behaviors. This instance is from gpt-4o-mini results in \ref{['fig:hard-fig']}, Pick First behavior from the top-left cell.
  • Figure 5: Models Over-represent Some Numbers; the modes are heavily over-represented. Each bar is the mode frequency, i.e., how often the top-chosen token is chosen averaged over distributions. The black lines mark the expected rate for a calibrated model.
  • ...and 19 more figures