Table of Contents
Fetching ...

Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

Gregor Baer, Chao Zhang, Isel Grau, Pieter Van Gorp

Abstract

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

Does Explanation Correctness Matter? Linking Computational XAI Evaluation to Human Understanding

Abstract

Explainable AI (XAI) methods are commonly evaluated with functional metrics such as correctness, which computationally estimate how accurately an explanation reflects the model's reasoning. Higher correctness is assumed to produce better human understanding, but this link has not been tested experimentally with controlled levels. We conducted a user study (N=200) that manipulated explanation correctness at four levels (100%, 85%, 70%, 55%) in a time series classification task where participants could not rely on domain knowledge or visual intuition and instead predicted the AI's decisions based on explanations (forward simulation). Correctness affected understanding, but not at every level: performance dropped at 70% and 55% correctness relative to fully correct explanations, while further degradation below 70% produced no additional loss. Rather than shifting performance uniformly, lower correctness decreased the proportion of participants who learned the decision pattern. At the same time, even fully correct explanations did not guarantee understanding, as only a subset of participants achieved high accuracy. Exploratory analyses showed that self-reported ratings correlated with demonstrated performance only when explanations were fully correct and participants had learned the pattern. These findings show that not all differences in functional correctness translate to differences in human understanding, underscoring the need to validate functional metrics against human outcomes.

Paper Structure

This paper contains 21 sections, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Overview of the experimental design, showing condition assignment, training and test phases, and the post-task survey.
  • Figure 2: Trial screen as shown to participants during both the training and test phases. Participants see a univariate time series curve overlaid with a heatmap indicating feature importance for the AI's decision. In the training phase, the next screen shows feedback with the AI's actual classification; in the test phase, no feedback is provided. The example shown here is from the 70% correctness condition, where the heatmap partially overlaps with the true discriminative feature and partially covers an adjacent non-informative region.
  • Figure 3: Synthetic data generation. Each time series combined a random walk (left) with a class-discriminative feature (center): a peak (Class A) or valley (Class B) spanning 15% of the series length. The aggregated signal (right) is the sum of both components, with the red region marking the ground truth feature location. The random walk produced peaks and valleys of similar magnitude, making the feature difficult to identify without an explanation.
  • Figure 4: Explanation generation and correctness manipulation. The true feature location was rendered as a heatmap, corrupted with noise, and smoothed to resemble a feature attribution explanation. Less correct explanations were created by spatially displacing the highlighted region. The example shown corresponds to a displacement of 45%, leaving 55% overlap with the true feature location (i.e., the 55% correctness condition).
  • Figure 5: Forward simulation accuracy by correctness condition. Each dot represents one participant. Horizontal lines in the violin mark the 25th, 50th, and 75th percentiles; the dashed line marks chance performance (0.5). The 100% condition is slightly bimodal, with participants clustering at either high accuracy or near chance. The 85% condition shows a similar high spread but without clear bimodality. In the 70% and 55% conditions, variance shrinks and accuracy concentrates near chance (0.5).
  • ...and 4 more figures