Table of Contents
Fetching ...

Measuring the metacognition of AI

Richard Servajean, Philippe Servajean

Abstract

A robust decision-making process must take into account uncertainty, especially when the choice involves inherent risks. Because artificial Intelligence (AI) systems are increasingly integrated into decision-making workflows, managing uncertainty relies more and more on the metacognitive capabilities of these systems; i.e, their ability to assess the reliability of and regulate their own decisions. Hence, it is crucial to employ robust methods to measure the metacognitive abilities of AI. This paper is primarily a methodological contribution arguing for the adoption of the meta-d' framework, or its model-free alternatives, as the gold standard for assessing the metacognitive sensitivity of AIs--the ability to generate confidence ratings that distinguish correct from incorrect responses. Moreover, we propose to leverage signal detection theory (SDT) to measure the ability of AIs to spontaneously regulate their decisions based on uncertainty and risk. To demonstrate the practical utility of these psychophysical frameworks, we conduct two series of experiments on three large language models (LLMs)--GPT-5, DeepSeek-V3.2-Exp, and Mistral-Medium-2508. In the first experiments, LLMs performed a primary judgment followed by a confidence rating. In the second, LLMs only performed the primary judgment, while we manipulated the risk associated with either response. On the one hand, applying the meta-d' framework allows us to conduct comparisons along three axes: comparing an LLM to optimality, comparing different LLMs on a given task, and comparing the same LLM across different tasks. On the other hand, SDT allows us to assess whether LLMs become more conservative when risks are high.

Measuring the metacognition of AI

Abstract

A robust decision-making process must take into account uncertainty, especially when the choice involves inherent risks. Because artificial Intelligence (AI) systems are increasingly integrated into decision-making workflows, managing uncertainty relies more and more on the metacognitive capabilities of these systems; i.e, their ability to assess the reliability of and regulate their own decisions. Hence, it is crucial to employ robust methods to measure the metacognitive abilities of AI. This paper is primarily a methodological contribution arguing for the adoption of the meta-d' framework, or its model-free alternatives, as the gold standard for assessing the metacognitive sensitivity of AIs--the ability to generate confidence ratings that distinguish correct from incorrect responses. Moreover, we propose to leverage signal detection theory (SDT) to measure the ability of AIs to spontaneously regulate their decisions based on uncertainty and risk. To demonstrate the practical utility of these psychophysical frameworks, we conduct two series of experiments on three large language models (LLMs)--GPT-5, DeepSeek-V3.2-Exp, and Mistral-Medium-2508. In the first experiments, LLMs performed a primary judgment followed by a confidence rating. In the second, LLMs only performed the primary judgment, while we manipulated the risk associated with either response. On the one hand, applying the meta-d' framework allows us to conduct comparisons along three axes: comparing an LLM to optimality, comparing different LLMs on a given task, and comparing the same LLM across different tasks. On the other hand, SDT allows us to assess whether LLMs become more conservative when risks are high.

Paper Structure

This paper contains 10 sections, 4 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: $d'$, meta-$d'$ and $M_{ratio}$ (left column) and the accuracy, or $P(\text{correct at type 1 task} \mid \text{confidence} = C_i )$ versus confidence level from C1 to C5 (right column). The results are shown for task A (top row), task B (middle row) and task C (bottom row). Error bars in the left column denote 95% CI (see the Methods for details). We tested GPT-5 (blue; left bars), DeepSeek-V3.2-Exp (orange; middle bars) and Mistral-Medium-2508 (yellow; right bars). See the legends on top of panel (a) and in panel (b). When computing accuracy versus confidence (right column), GPT-5 reported confidence C1 in one trial of task B and one trial of task C and similarly, in task B, Mistral-Medium-2508 reported confidence C3 in approximately 0.1% of trials; these observations were excluded, leaving no data points at those confidence levels. The number of trials was $2 \times 10^3$ for task A and $10^3$ otherwise (See the Methods for further details).
  • Figure 2: Type 1 criterion $c$ across three risk configurations ("S1", "None" , "S2") within task A (top row), task B (middle row) and task C (bottom row), for GPT-5 (left column), DeepSeek-V3.2-Exp (middle column), and Mistral-Medium-2508 (right column). Error bars represent 95% confidence intervals estimated via the Delta method. Although the Delta method can be sensitive to extreme proportions (that is, hit rates or false alarm rates close to 0 or 1), as may occur when responses are strongly biased toward S1 or S2, the large trial count ($N \geq 10^4$) mitigates potential issues. See the Methods for details.
  • Figure S1: Conditional proportion of confidence rating given that the response at the type 1 task was correct (green; right bars) or incorrect (red; left bars), or $P(\text{confidence} = C_i \mid \text{correct or incorrect at type 1 task})$ for task A (first row), task B (middle row) and task C (bottom row). See \ref{['fig:meta_d']} for further details.
  • Figure S2: $c' = c / d'$ across three risk configurations ("S1", "None" , "S2") within task A (top row), task B (middle row) and task C (bottom row), for GPT-5 (left column), DeepSeek-V3.2-Exp (middle column), and Mistral-Medium-2508 (right column). Error bars represent 95% confidence intervals estimated via the Delta method. See the Methods for details.
  • Figure S3: Average $d'$ (a) and meta-$d'$ (b) versus number of trials. Each data point is the estimated $d'$ and meta-$d'$, averaged over 20 repetitions, where each $d'$ and meta-$d'$ is computed with a given number of trials that is tuned in the x-axis. Error bars correspond to standard deviations. In panel (a), the error bars vanish because the simulation procedure deterministically set the type 1 response counts. The data has been generated using the function exampleFit.m from Ref. fleming2017hmeta, available at https://github.com/smfleming/HMeta-d, with the true $d'$ and meta-$d'$, indicated by a red horizontal dashed line in each panel, equal to 3.2 and 3 (i.e., very high type 1 sensitivity and high metacognitive efficiency), respectively. Type 1 criterion $c = 0$ and type 2 criteria are as follows: c1 = [-2 -1.5 -1 -0.5], c2 = [0.5 1 1.5 2].