Table of Contents
Fetching ...

Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets

Lukas Nel

TL;DR

KalshiBench introduces a temporally-filtered, prediction-market-based benchmark to assess epistemic calibration in large language models, avoiding memorization biases by using real-world outcomes resolved after model cutoffs. Evaluating five frontier models reveals universal overconfidence and a decoupling between calibration and accuracy, with reasoning-enhanced models performing worse on calibration. Only one model modestly beats a base-rate baseline, and extended reasoning does not reliably improve uncertainty quantification. The work highlights calibration as a distinct capability requiring targeted development, and it provides reproducible baselines and domain-aware insights for deployment in high-stakes settings.

Abstract

A well-calibrated model should express confidence that matches its actual accuracy -- when it claims 80\% confidence, it should be correct 80\% of the time. While large language models (LLMs) have achieved remarkable performance across diverse tasks, their epistemic calibration remains poorly understood. We introduce \textbf{KalshiBench}, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange, with verifiable real-world outcomes occurring after model training cutoffs. Unlike traditional benchmarks measuring accuracy on static knowledge, KalshiBench evaluates whether models can appropriately quantify uncertainty about genuinely unknown future events. We evaluate five frontier models -- Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 -- and find \textbf{systematic overconfidence across all models}. Even the best-calibrated model (Claude Opus 4.5, ECE=0.120) shows substantial calibration errors, while reasoning-enhanced models like GPT-5.2-XHigh exhibit \emph{worse} calibration (ECE=0.395) despite comparable accuracy. Critically, only one model achieves a positive Brier Skill Score, indicating most models perform worse than simply predicting base rates. Our findings suggest that scaling and enhanced reasoning do not automatically confer calibration benefits, highlighting epistemic calibration as a distinct capability requiring targeted development.

Do Large Language Models Know What They Don't Know? Kalshibench: A New Benchmark for Evaluating Epistemic Calibration via Prediction Markets

TL;DR

KalshiBench introduces a temporally-filtered, prediction-market-based benchmark to assess epistemic calibration in large language models, avoiding memorization biases by using real-world outcomes resolved after model cutoffs. Evaluating five frontier models reveals universal overconfidence and a decoupling between calibration and accuracy, with reasoning-enhanced models performing worse on calibration. Only one model modestly beats a base-rate baseline, and extended reasoning does not reliably improve uncertainty quantification. The work highlights calibration as a distinct capability requiring targeted development, and it provides reproducible baselines and domain-aware insights for deployment in high-stakes settings.

Abstract

A well-calibrated model should express confidence that matches its actual accuracy -- when it claims 80\% confidence, it should be correct 80\% of the time. While large language models (LLMs) have achieved remarkable performance across diverse tasks, their epistemic calibration remains poorly understood. We introduce \textbf{KalshiBench}, a benchmark of 300 prediction market questions from Kalshi, a CFTC-regulated exchange, with verifiable real-world outcomes occurring after model training cutoffs. Unlike traditional benchmarks measuring accuracy on static knowledge, KalshiBench evaluates whether models can appropriately quantify uncertainty about genuinely unknown future events. We evaluate five frontier models -- Claude Opus 4.5, GPT-5.2, DeepSeek-V3.2, Qwen3-235B, and Kimi-K2 -- and find \textbf{systematic overconfidence across all models}. Even the best-calibrated model (Claude Opus 4.5, ECE=0.120) shows substantial calibration errors, while reasoning-enhanced models like GPT-5.2-XHigh exhibit \emph{worse} calibration (ECE=0.395) despite comparable accuracy. Critically, only one model achieves a positive Brier Skill Score, indicating most models perform worse than simply predicting base rates. Our findings suggest that scaling and enhanced reasoning do not automatically confer calibration benefits, highlighting epistemic calibration as a distinct capability requiring targeted development.

Paper Structure

This paper contains 59 sections, 6 equations, 1 figure, 11 tables.

Figures (1)

  • Figure 1: Summary of main results. While accuracy varies modestly (64-69%), calibration error varies dramatically (3$\times$ range). Reasoning enhancements (GPT-5.2-XHigh) worsen rather than improve calibration.