Table of Contents
Fetching ...

Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

Preetum Nakkiran, Arwen Bradley, Adam Goliński, Eugene Ndiaye, Michael Kirchhof, Sinead Williamson

TL;DR

This work investigates the emergence of semantic calibration in large language models (LLMs) byIntroducing a formal framework based on semantic collapsing functions $B$ and the induced category distributions $\pi_x = B_x \sharp p_x$. The authors prove that $B$-confidence-calibration is equivalent to local loss optimality with respect to a class of semantic perturbations $\mathcal{W}_B$, suggesting calibration arises as a byproduct of next-token prediction if the model can anticipate its own semantic distribution early in generation. They further show that perturbations are efficiently implementable via simple circuits when the model has access to intermediate $B$-confidences, leading to testable predictions: base LLMs should exhibit semantic calibration across open-domain QA; RL instruction-tuning and chain-of-thought can degrade calibration. Through extensive experiments on GSM8K, OpenMathInstruct-2, TriviaQA, and SimpleQA with models from 0.5B to 72B, they demonstrate that base models are indeed semantically calibrated in non-CoT settings, while instruction-tuned or CoT configurations often break calibration, with a measurable correlation between the learnability of $B_x \sharp p_x$ (via LoRA probes) and calibration performance. The results provide a principled explanation for when and why semantic calibration emerges, offering insights into LLM uncertainty and design implications for training regimes and prompting strategies. The work highlights limitations around the specific calibration notion studied and calls for future work on broader calibration forms and practical deployment considerations.

Abstract

Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Trained on Tokens, Calibrated on Concepts: The Emergence of Semantic Calibration in LLMs

TL;DR

This work investigates the emergence of semantic calibration in large language models (LLMs) byIntroducing a formal framework based on semantic collapsing functions and the induced category distributions . The authors prove that -confidence-calibration is equivalent to local loss optimality with respect to a class of semantic perturbations , suggesting calibration arises as a byproduct of next-token prediction if the model can anticipate its own semantic distribution early in generation. They further show that perturbations are efficiently implementable via simple circuits when the model has access to intermediate -confidences, leading to testable predictions: base LLMs should exhibit semantic calibration across open-domain QA; RL instruction-tuning and chain-of-thought can degrade calibration. Through extensive experiments on GSM8K, OpenMathInstruct-2, TriviaQA, and SimpleQA with models from 0.5B to 72B, they demonstrate that base models are indeed semantically calibrated in non-CoT settings, while instruction-tuned or CoT configurations often break calibration, with a measurable correlation between the learnability of (via LoRA probes) and calibration performance. The results provide a principled explanation for when and why semantic calibration emerges, offering insights into LLM uncertainty and design implications for training regimes and prompting strategies. The work highlights limitations around the specific calibration notion studied and calls for future work on broader calibration forms and practical deployment considerations.

Abstract

Large Language Models (LLMs) often lack meaningful confidence estimates for their outputs. While base LLMs are known to exhibit next-token calibration, it remains unclear whether they can assess confidence in the actual meaning of their responses beyond the token level. We find that, when using a certain sampling-based notion of semantic calibration, base LLMs are remarkably well-calibrated: they can meaningfully assess confidence in open-domain question-answering tasks, despite not being explicitly trained to do so. Our main theoretical contribution establishes a mechanism for why semantic calibration emerges as a byproduct of next-token prediction, leveraging a recent connection between calibration and local loss optimality. The theory relies on a general definition of "B-calibration," which is a notion of calibration parameterized by a choice of equivalence classes (semantic or otherwise). This theoretical mechanism leads to a testable prediction: base LLMs will be semantically calibrated when they can easily predict their own distribution over semantic answer classes before generating a response. We state three implications of this prediction, which we validate through experiments: (1) Base LLMs are semantically calibrated across question-answering tasks, (2) RL instruction-tuning systematically breaks this calibration, and (3) chain-of-thought reasoning breaks calibration. To our knowledge, our work provides the first principled explanation of when and why semantic calibration emerges in LLMs.

Paper Structure

This paper contains 61 sections, 12 theorems, 84 equations, 47 figures, 2 tables.

Key Result

Theorem 1

For all models $p_\theta$, collapsing functions $B$ and distributions $\mathcal{D}$, the following are equivalent:

Figures (47)

  • Figure 1: Semantic calibration refers to calibration of an LLM-induced semantic classifier (dashed box): the classifier induced by post-processing LLM outputs with a given semantic collapsing function, which we refer to as $B$ throughout. To measure semantic confidence calibration: for a given question, sample multiple temperature $T\!=\!1$ generations, and extract semantic answers by applying the collapsing function $B$ (e.g. a strong LLM prompted to extract one-word answers). This yields an empirical distribution over semantic classes (above: Paris, Rome, Berlin), which we treat as the classifier output. This classifier output defines a semantic prediction (=argmax probability) and a semantic confidence (=max probability). Semantic confidence calibration means, over all questions, these predictions are confidence-calibrated in the standard classification sense.
  • Figure 2: Semantic Calibration of LLMs. Overlaid reliability diagrams evaluating semantic calibration of Qwen, Gemini, Mistral, and Llama-family models of sizes from 0.5B to 70B, on four datasets. Each model is prompted to respond in one of three different styles: a single word ("concise"), a complete sentence ("sentence"), or using chain-of-thought ("CoT"). This yields 6 color-coded configurations for each model: (model-variant, response-style) $\in${Base, Instruct} $\times$ {Concise, Sentence, CoT}. We group these configurations into two rows based on our theoretical predictions. First row (predicted calibrated): Reliability diagrams of all configurations predicted to be confidence-calibrated according to our theory: base models with concise or sentence response types. Second row (not predicted calibrated): Configurations which need not be calibrated according to our theory: post-trained instruct models with any response type: concise, sentence, chain-of-thought; and base models with chain-of-thought. Third row: Box plots summarizing the distribution of calibration errors for each of the 6 configurations. Only the first two configurations (base-concise and base-sentence) are reliably well-calibrated, as predicted by our theory. Individual reliability diagrams for all experiments are in \ref{['app:encyclopedia']}.
  • Figure 3: Reliability diagrams demonstrating semantic confidence-calibration of base (pretrained-only) LLMs across various combinations of datasets, models, and prompts. Calibration error measured with SmoothECE (smECE), average confidence and accuracy marked with a black cross, and density of semantic confidences shown in gray histogram; details in Appendix \ref{['app:rel-diagram']}.
  • Figure 4: Conjectured Mechanism for Semantic Calibration. Implications have varying levels of support: the solid blue arrow () has a formal proof; the dashed blue arrows () have proofs of "morally similar" (but weaker) implications. \ref{['claim:main_heuristic']} encompasses the full chain of implications, and has experimental support.
  • Figure 5:
  • ...and 42 more figures

Theorems & Definitions (54)

  • Definition 1: Confidence-calibration
  • Definition 2: $B$-confidence-calibration
  • Definition 3: Perturbation operator
  • Definition 4: Perturbed model
  • Definition 5: $\mathcal{W}$-local loss optimality
  • Definition 6: Semantic Perturbation Function Classes
  • Theorem 1: Equivalence of Calibration and Local Loss Optimality
  • Remark 1
  • Definition 7: Intermediate $B$-Confidences
  • Theorem 2
  • ...and 44 more