Table of Contents
Fetching ...

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee Boyd-Graber

TL;DR

GRACE presents a granular, human-grounded benchmark for evaluating model calibration by using incremental, adversarial clues and live human-vs-model competitions. It introduces CalScore, a human-adjusted calibration metric that factors in human performance to penalize confidently wrong or underconfident model behavior relative to humans. The dataset enables per-instance calibration analysis and reveals that state-of-the-art models remain significantly miscalibrated compared with humans, despite sometimes higher raw accuracy. CalScore correlates with traditional metrics but uncovers additional miscalibration patterns, especially for weaker models, highlighting concrete directions to improve calibration and human–AI collaboration. Overall, GRACE provides a principled framework for diagnosing and guiding progress toward better-calibrated language models.

Abstract

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration

TL;DR

GRACE presents a granular, human-grounded benchmark for evaluating model calibration by using incremental, adversarial clues and live human-vs-model competitions. It introduces CalScore, a human-adjusted calibration metric that factors in human performance to penalize confidently wrong or underconfident model behavior relative to humans. The dataset enables per-instance calibration analysis and reveals that state-of-the-art models remain significantly miscalibrated compared with humans, despite sometimes higher raw accuracy. CalScore correlates with traditional metrics but uncovers additional miscalibration patterns, especially for weaker models, highlighting concrete directions to improve calibration and human–AI collaboration. Overall, GRACE provides a principled framework for diagnosing and guiding progress toward better-calibrated language models.

Abstract

Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.

Paper Structure

This paper contains 48 sections, 10 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: To create the GRACE dataset, expert question writers develop questions with multiple clues of decreasing difficulty via an interface that shows where weaker models struggle to answer the questions. These questions are used in human vs. model competitions where teams compete to be the first to interrupt the sequence of clues with a correct answer. We record when the human and model teams buzz in each question with their correctness (+) or incorrectness (-) (buzzpoints). The dataset contains all buzzpoints throughout the competition. Then, CalScore measures each model's human-grounded calibration performance (§ \ref{['sec:main-metric']}).
  • Figure 2: Example question on Chinese literature (with the answer of three) being written in the interface. Writers compose questions in the left box. On the right, they see the model's guess and confidence after every sentence and the point at which the model would buzz in and attempt to answer. Writers learn which sentences make it harder for models to answer correctly and refine their questions to be sufficiently hard for models but still answerable by humans. This incremental, adversarial format permits granular calibration measurement.
  • Figure 3: While GPT-4o buzzes too early with an incorrect answer, losing 5 points, the human team (H1) buzzes later with a correct answer, earning 10 points. Both teams must balance accuracy and speed; here, GPT-4o shows poorer calibration than H1.
  • Figure 4: Each team's cumulative buzzes (normalized by the number of matches each team participated in). The top quartile of human teams (Q4) achieves the highest cumulative correct buzz rate, peaking over twice as high as the best model. Top human teams are thus more accurate and better-calibrated than models, even as the difficulty changes when more clues are revealed.
  • Figure 5: Comparison of human and model average accuracy rates as more clues are revealed (whether the team's guess is correct after seeing the first $n$ clues). As more clues are revealed, accuracy improves for both models and humans. Models often answer incorrectly until most clues are provided, and human accuracy increases more rapidly, validating that each instance becomes easier for both humans and models and that most humans can answer correctly by the end.
  • ...and 1 more figures