Calibrating Language Models with Adaptive Temperature Scaling
Johnathan Xie, Annie S. Chen, Yoonho Lee, Eric Mitchell, Chelsea Finn
TL;DR
The paper tackles the problem of calibration degradation in large language models after RLHF fine-tuning by introducing Adaptive Temperature Scaling (ATS), a post-hoc method that predicts a per-token temperature from hidden-state features to calibrate token probabilities without altering their rankings. ATS uses a lightweight calibration head to produce a per-token temperature vector, applying it multiplicatively to logits and trained with a selective-smoothing loss that treats correct and incorrect predictions differently. Across MMLU, TriviaQA, and TruthfulQA, ATS achieves 10-50% improvements in calibration metrics (ECE and Brier score) on post-RLHF models like Llama-2-7b-Chat and Qwen-7b-Chat, with minimal or no impact on downstream performance. The findings demonstrate that per-token, context-aware calibration can yield reliable confidence estimates in real-world deployments, using Alpaca GPT-4 as the calibration dataset to show generalization across models and tasks.
Abstract
The effectiveness of large language models (LLMs) is not only measured by their ability to generate accurate outputs but also by their calibration-how well their confidence scores reflect the probability of their outputs being correct. While unsupervised pre-training has been shown to yield LLMs with well-calibrated conditional probabilities, recent studies have shown that after fine-tuning with reinforcement learning from human feedback (RLHF), the calibration of these models degrades significantly. In this work, we introduce Adaptive Temperature Scaling (ATS), a post-hoc calibration method that predicts a temperature scaling parameter for each token prediction. The predicted temperature values adapt based on token-level features and are fit over a standard supervised fine-tuning (SFT) dataset. The adaptive nature of ATS addresses the varying degrees of calibration shift that can occur after RLHF fine-tuning. ATS improves calibration by over 10-50% across three downstream natural language evaluation benchmarks compared to prior calibration methods and does not impede performance improvements from RLHF.
