Table of Contents
Fetching ...

Few-Shot Recalibration of Language Models

Xiang Lisa Li, Urvashi Khandelwal, Kelvin Guu

TL;DR

A new framework for few-shot slice-specific recalibration that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice, and can recalibrate for arbitrary new slices, without using any labeled data from that slice.

Abstract

Recent work has uncovered promising ways to extract well-calibrated confidence estimates from language models (LMs), where the model's confidence score reflects how likely it is to be correct. However, while LMs may appear well-calibrated over broad distributions, this often hides significant miscalibration within narrower slices (e.g., systemic over-confidence in math can balance out systemic under-confidence in history, yielding perfect calibration in aggregate). To attain well-calibrated confidence estimates for any slice of a distribution, we propose a new framework for few-shot slice-specific recalibration. Specifically, we train a recalibration model that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice. Our trained model can recalibrate for arbitrary new slices, without using any labeled data from that slice. This enables us to identify domain-specific confidence thresholds above which the LM's predictions can be trusted, and below which it should abstain. Experiments show that our few-shot recalibrator consistently outperforms existing calibration methods, for instance improving calibration error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.

Few-Shot Recalibration of Language Models

TL;DR

A new framework for few-shot slice-specific recalibration that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice, and can recalibrate for arbitrary new slices, without using any labeled data from that slice.

Abstract

Recent work has uncovered promising ways to extract well-calibrated confidence estimates from language models (LMs), where the model's confidence score reflects how likely it is to be correct. However, while LMs may appear well-calibrated over broad distributions, this often hides significant miscalibration within narrower slices (e.g., systemic over-confidence in math can balance out systemic under-confidence in history, yielding perfect calibration in aggregate). To attain well-calibrated confidence estimates for any slice of a distribution, we propose a new framework for few-shot slice-specific recalibration. Specifically, we train a recalibration model that takes in a few unlabeled examples from any given slice and predicts a curve that remaps confidence scores to be more accurate for that slice. Our trained model can recalibrate for arbitrary new slices, without using any labeled data from that slice. This enables us to identify domain-specific confidence thresholds above which the LM's predictions can be trusted, and below which it should abstain. Experiments show that our few-shot recalibrator consistently outperforms existing calibration methods, for instance improving calibration error for PaLM2-Large on MMLU by 16%, as compared to temperature scaling.
Paper Structure (34 sections, 2 equations, 5 figures, 10 tables, 1 algorithm)

This paper contains 34 sections, 2 equations, 5 figures, 10 tables, 1 algorithm.

Figures (5)

  • Figure 1: An example of the illusion of LM calibration. For a combination of five domains, the model is well-calibrated with a calibration error of 0.02 (the first plot). However, the same model is miscalibrated on the the five individual domains, each with a higher calibration error.
  • Figure 2: A histogram of ECE scores for LLaMA-65B on 57 MMLU domains. The red line shows ECE for all the domains combined. We can see the aggregate ECE is lower than most domains, hiding the underlying miscalibration problem.
  • Figure 3: An illustration of the few-shot recalibrator. This model learns to predict the precision curve for slices (e.g. psychology only, or 20% psychology-80% biology) of a broader distribution (mix of psychology, biology, botany etc.), using few-shot unlabeled examples. At test time, it can predict the precision curve for an unseen slice (e.g. 66% botany-34% biology) given only an unlabeled few-shot set drawn from it. This precision curve can then be used to accomplish various downstream goals.
  • Figure 4: Our approach works well even with small few-shot sets.
  • Figure 5: Examples of precision curves generated by the few-shot recalibrator, compared to the Empirical and Oracle curves. Our curves approximate the Oracle curves more closely.