GRACE: A Granular Benchmark for Evaluating Model Calibration against Human Calibration
Yoo Yeon Sung, Eve Fleisig, Yu Hou, Ishan Upadhyay, Jordan Lee Boyd-Graber
TL;DR
GRACE presents a granular, human-grounded benchmark for evaluating model calibration by using incremental, adversarial clues and live human-vs-model competitions. It introduces CalScore, a human-adjusted calibration metric that factors in human performance to penalize confidently wrong or underconfident model behavior relative to humans. The dataset enables per-instance calibration analysis and reveals that state-of-the-art models remain significantly miscalibrated compared with humans, despite sometimes higher raw accuracy. CalScore correlates with traditional metrics but uncovers additional miscalibration patterns, especially for weaker models, highlighting concrete directions to improve calibration and human–AI collaboration. Overall, GRACE provides a principled framework for diagnosing and guiding progress toward better-calibrated language models.
Abstract
Language models are often miscalibrated, leading to confidently incorrect answers. We introduce GRACE, a benchmark for language model calibration that incorporates comparison with human calibration. GRACE consists of question-answer pairs, in which each question contains a series of clues that gradually become easier, all leading to the same answer; models must answer correctly as early as possible as the clues are revealed. This setting permits granular measurement of model calibration based on how early, accurately, and confidently a model answers. After collecting these questions, we host live human vs. model competitions to gather 1,749 data points on human and model teams' timing, accuracy, and confidence. We propose a metric, CalScore, that uses GRACE to analyze model calibration errors and identify types of model miscalibration that differ from human behavior. We find that although humans are less accurate than models, humans are generally better calibrated. Since state-of-the-art models struggle on GRACE, it effectively evaluates progress on improving model calibration.
