Table of Contents
Fetching ...

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Sudipta Ghosh, Mrityunjoy Panday

TL;DR

An empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities that demonstrates that poorly performing models display markedly higher overconfidence.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

TL;DR

An empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities that demonstrates that poorly performing models display markedly higher overconfidence.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.
Paper Structure (28 sections, 1 equation, 3 figures, 4 tables)

This paper contains 28 sections, 1 equation, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Confidence vs. accuracy across models, demonstrating the Dunning-Kruger effect. The diagonal line represents perfect calibration. Kimi K2 shows severe overconfidence (top-left quadrant) while Claude Haiku 4.5 exhibits near-optimal calibration.
  • Figure 2: ECE comparison across models. Lower values indicate better calibration. Claude Haiku 4.5 achieves the best calibration, while Kimi K2 shows severe miscalibration.
  • Figure 3: Reliability diagrams for Claude Haiku 4.5 (left) and Kimi K2 (right). The diagonal represents perfect calibration. Claude Haiku 4.5 shows well-distributed confidence with close adherence to the diagonal, while Kimi K2 shows severe overconfidence with most responses clustered at high confidence despite low accuracy.