Table of Contents
Fetching ...

Do Code Models Suffer from the Dunning-Kruger Effect?

Mukul Singh, Somya Chatterjee, Arjun Radhakrishna, Sumit Gulwani

TL;DR

This work investigates whether code-focused AI systems exhibit the Dunning-Kruger Effect, comparing perceived confidence to actual performance across 37 programming languages using MCQA derived from CodeNet. By employing absolute confidence and relative confidence measures (Elo and TrueSkill), the study shows that lower-competence systems tend to overestimate their abilities, while higher-competence systems are more calibrated or even undersure, with stronger miscalibration in rarer languages and when models are domain-specialized. The findings highlight the importance of robust self-assessment mechanisms for trustworthy human-AI collaboration and suggest that domain rarity and specialization modulate calibration, raising questions about the underlying causes and how to mitigate them. The work lays groundwork for future research on whether DKE in AI arises from cognitive-like processes or statistical artifacts, and on developing practical approaches to improve reliability in co-creative coding tasks.

Abstract

As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.

Do Code Models Suffer from the Dunning-Kruger Effect?

TL;DR

This work investigates whether code-focused AI systems exhibit the Dunning-Kruger Effect, comparing perceived confidence to actual performance across 37 programming languages using MCQA derived from CodeNet. By employing absolute confidence and relative confidence measures (Elo and TrueSkill), the study shows that lower-competence systems tend to overestimate their abilities, while higher-competence systems are more calibrated or even undersure, with stronger miscalibration in rarer languages and when models are domain-specialized. The findings highlight the importance of robust self-assessment mechanisms for trustworthy human-AI collaboration and suggest that domain rarity and specialization modulate calibration, raising questions about the underlying causes and how to mitigate them. The work lays groundwork for future research on whether DKE in AI arises from cognitive-like processes or statistical artifacts, and on developing practical approaches to improve reliability in co-creative coding tasks.

Abstract

As artificial intelligence systems increasingly collaborate with humans in creative and technical domains, questions arise about the cognitive boundaries and biases that shape our shared agency. This paper investigates the Dunning-Kruger Effect (DKE), the tendency for those with limited competence to overestimate their abilities in state-of-the-art LLMs in coding tasks. By analyzing model confidence and performance across a diverse set of programming languages, we reveal that AI models mirror human patterns of overconfidence, especially in unfamiliar or low-resource domains. Our experiments demonstrate that less competent models and those operating in rare programming languages exhibit stronger DKE-like bias, suggesting that the strength of the bias is proportionate to the competence of the models.

Paper Structure

This paper contains 35 sections, 6 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Actual vs. perceived performance for GPT-4o across different languages sorted by actual performance
  • Figure 2: Inter-model DKE
  • Figure 3: Dunning-Kruger plots for various models.
  • Figure 4: Dunning-Kruger plots for various programming languages.