The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Sudipta Ghosh; Mrityunjoy Panday

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Sudipta Ghosh, Mrityunjoy Panday

TL;DR

An empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities that demonstrates that poorly performing models display markedly higher overconfidence.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities across diverse tasks, yet their ability to accurately assess their own confidence remains poorly understood. We present an empirical study investigating whether LLMs exhibit patterns reminiscent of the Dunning-Kruger effect -- a cognitive bias where individuals with limited competence tend to overestimate their abilities. We evaluate four state-of-the-art models (Claude Haiku 4.5, Gemini 2.5 Pro, Gemini 2.5 Flash, and Kimi K2) across four benchmark datasets totaling 24,000 experimental trials. Our results reveal striking calibration differences: Kimi K2 exhibits severe overconfidence with an Expected Calibration Error (ECE) of 0.726 despite only 23.3% accuracy, while Claude Haiku 4.5 achieves the best calibration (ECE = 0.122) with 75.4% accuracy. These findings demonstrate that poorly performing models display markedly higher overconfidence -- a pattern analogous to the Dunning-Kruger effect in human cognition. We discuss implications for safe deployment of LLMs in high-stakes applications.

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

TL;DR

Abstract

Paper Structure (28 sections, 1 equation, 3 figures, 4 tables)

This paper contains 28 sections, 1 equation, 3 figures, 4 tables.

Introduction
Related Work
Confidence Calibration in Machine Learning
Confidence-Competence Gap in LLMs
Uncertainty Quantification in LLMs
Overconfidence and Calibration Methods
LLM Benchmarking
Methodology
Experimental Design
Models Under Evaluation
Benchmark Datasets
Sample Size
Confidence Elicitation Protocol
Evaluation Metrics
Results
...and 13 more sections

Figures (3)

Figure 1: Confidence vs. accuracy across models, demonstrating the Dunning-Kruger effect. The diagonal line represents perfect calibration. Kimi K2 shows severe overconfidence (top-left quadrant) while Claude Haiku 4.5 exhibits near-optimal calibration.
Figure 2: ECE comparison across models. Lower values indicate better calibration. Claude Haiku 4.5 achieves the best calibration, while Kimi K2 shows severe miscalibration.
Figure 3: Reliability diagrams for Claude Haiku 4.5 (left) and Kimi K2 (right). The diagonal represents perfect calibration. Claude Haiku 4.5 shows well-distributed confidence with close adherence to the diagonal, while Kimi K2 shows severe overconfidence with most responses clustered at high confidence despite low accuracy.

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

TL;DR

Abstract

The Dunning-Kruger Effect in Large Language Models: An Empirical Study of Confidence Calibration

Authors

TL;DR

Abstract

Table of Contents

Figures (3)