Linguistic Calibration of Long-Form Generations

Neil Band; Xuechen Li; Tengyu Ma; Tatsunori Hashimoto

Linguistic Calibration of Long-Form Generations

Neil Band, Xuechen Li, Tengyu Ma, Tatsunori Hashimoto

TL;DR

This work tackles confident hallucinations in long-form LM outputs by introducing linguistic calibration (LC), which aligns long-form generations with calibrated downstream forecasts. It defines LC via a decision-theoretic lens, linking a downstream reader's forecasts to the true outcome distribution $p(y|x)$ and optimizing with a strictly proper scoring rule; the authors implement a two-stage training pipeline (supervised finetuning to obtain $\pi_{SFT}$ followed by reinforcement learning to obtain $\pi_{LC}$) using a surrogate reader. Empirically, LC applied to Llama 2 7B achieves significantly better calibration (lower reader ECE) than strong factuality baselines while preserving accuracy, and generalizes across domain shifts to scientific QA and biography generation without task-specific retraining. The paper also provides theoretical connections showing that linguistic calibration implies no-regret and accurate loss estimation guarantees for downstream decision-making, supporting the practical value of calibrating long-form text end-to-end. Overall, LC offers a principled path to safer, more interpretable long-form LMs in real-world decision contexts by calibrating the space of user predictions derived from model-produced text.

Abstract

Language models (LMs) may lead their users to make suboptimal downstream decisions when they confidently hallucinate. This issue can be mitigated by having the LM verbally convey the probability that its claims are correct, but existing models cannot produce long-form text with calibrated confidence statements. Through the lens of decision-making, we define linguistic calibration for long-form generations: an LM is linguistically calibrated if its generations enable its users to make calibrated probabilistic predictions. This definition enables a training framework where a supervised finetuning step bootstraps an LM to emit long-form generations with confidence statements such as "I estimate a 30% chance of..." or "I am certain that...", followed by a reinforcement learning step which rewards generations that enable a user to provide calibrated answers to related questions. We linguistically calibrate Llama 2 7B and find in automated and human evaluations of long-form generations that it is significantly more calibrated than strong finetuned factuality baselines with comparable accuracy. These findings generalize under significant domain shifts to scientific and biomedical questions and to an entirely held-out person biography generation task. Our results demonstrate that long-form generations may be calibrated end-to-end by constructing an objective in the space of the predictions that users make in downstream decision-making.

Linguistic Calibration of Long-Form Generations

TL;DR

and optimizing with a strictly proper scoring rule; the authors implement a two-stage training pipeline (supervised finetuning to obtain

followed by reinforcement learning to obtain

) using a surrogate reader. Empirically, LC applied to Llama 2 7B achieves significantly better calibration (lower reader ECE) than strong factuality baselines while preserving accuracy, and generalizes across domain shifts to scientific QA and biography generation without task-specific retraining. The paper also provides theoretical connections showing that linguistic calibration implies no-regret and accurate loss estimation guarantees for downstream decision-making, supporting the practical value of calibrating long-form text end-to-end. Overall, LC offers a principled path to safer, more interpretable long-form LMs in real-world decision contexts by calibrating the space of user predictions derived from model-produced text.

Abstract

Paper Structure (96 sections, 5 theorems, 27 equations, 17 figures, 7 tables, 1 algorithm)

This paper contains 96 sections, 5 theorems, 27 equations, 17 figures, 7 tables, 1 algorithm.

Introduction
Our contributions.
Setup
Linguistic Calibration of Long-Form Generations
LM-assisted user forecasting.
Defining linguistic calibration for long-form generations.
Examples of linguistic $\phi$-calibration.
From Calibration to Optimal Decisions
LM-assisted user decision-making.
Linguistic calibration implies informed decision-making.
Guarantees for weaker notions of calibration.
Training Objective for Linguistic Calibration
Proper scoring rules.
Linguistic calibration objective.
Method
...and 81 more sections

Key Result

Lemma 1

If a reader $f: \mathcal{X} \times \mathcal{Z} \rightarrow \Delta^{|\mathcal{Y}|}$ is $\mathcal{L}$-decision calibrated, then it satisfies:

Figures (17)

Figure 1: Illustrative example of linguistic calibration. We define linguistic calibration of long-form generations (LC) as calibrating an LM's generations in a way that leads to calibrated downstream user forecasts. We apply LC to train an LM that emits calibrated statements of confidence in natural language, enabling better downstream decisions. Left: users read long-form generations (e.g., a doctor reading an LM-generated clinical note). Middle: to decide the patient's treatment, the doctor first forecasts the patient's underlying condition. Upper Right: when standard LMs lack knowledge, they hallucinate confidently, leading to a suboptimal decision (treating the wrong condition). Lower Right: even if the base LM cannot be confidently correct, linguistic calibration encourages the LM to spread probability over plausible claims, enabling a better decision.
Figure 2: Our training framework for linguistic calibration of long-form generations (LC) calibrates the long-form generations of an LM by calibrating downstream user forecasts. It involves two steps: summary distillation (Upper) and decision-based RL (Lower). Datasets are in white, LMs in blue, and steps involving user or surrogate forecasts are in green.
Figure 3: Accuracy-ECE Frontier for Question-Answering (upper left is better). LC RL pareto-dominates Factuality RL and SFT, with significantly better reader ECE while matching or exceeding their accuracy.
Figure 4: TriviaQA Reliability Diagrams. LC models display a wide range of confidences and good calibration in their long-form generations, with LC RL improving calibration further. Human and simulated results closely match.
Figure 5: Qualitative example from Factuality and LC RL when evaluated under task distribution shift on biography generation. LC RL produces numerical and linguistic confidence statements throughout the paragraph, highlighted in blue. False statements are highlighted in red. We include additional examples in Appendix \ref{['app:qualitative_examples']}.
...and 12 more figures

Theorems & Definitions (11)

Definition 2.1: Linguistic Calibration of Long-Form Generations
Definition B.2: Decision Calibration, Definition 2 in zhao2021calibrating
Definition B.3: Decision Calibration with LM Assistance
Lemma 1: instantiation of Proposition 1 in zhao2021calibrating
Theorem B.4: Linguistic $\phi$-calibration implies no regret and accurate loss estimation guarantees
proof
Definition C.1: based on gneiting2007strictly
Lemma 2: e.g., gneiting2007strictly
Lemma 3: e.g., p. 29, cover1991elements
Theorem C.2
...and 1 more

Linguistic Calibration of Long-Form Generations

TL;DR

Abstract

Linguistic Calibration of Long-Form Generations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (11)