Do Metrics for Counterfactual Explanations Align with User Perception?

Felix Liedeker; Basil Ell; Philipp Cimiano; Christoph Düsing

Do Metrics for Counterfactual Explanations Align with User Perception?

Felix Liedeker, Basil Ell, Philipp Cimiano, Christoph Düsing

Abstract

Explainability is widely regarded as essential for trustworthy artificial intelligence systems. However, the metrics commonly used to evaluate counterfactual explanations are algorithmic evaluation metrics that are rarely validated against human judgments of explanation quality. This raises the question of whether such metrics meaningfully reflect user perceptions. We address this question through an empirical study that directly compares algorithmic evaluation metrics with human judgments across three datasets. Participants rated counterfactual explanations along multiple dimensions of perceived quality, which we relate to a comprehensive set of standard counterfactual metrics. We analyze both individual relationships and the extent to which combinations of metrics can predict human assessments. Our results show that correlations between algorithmic metrics and human ratings are generally weak and strongly dataset-dependent. Moreover, increasing the number of metrics used in predictive models does not lead to reliable improvements, indicating structural limitations in how current metrics capture criteria relevant for humans. Overall, our findings suggest that widely used counterfactual evaluation metrics fail to reflect key aspects of explanation quality as perceived by users, underscoring the need for more human-centered approaches to evaluating explainable artificial intelligence.

Do Metrics for Counterfactual Explanations Align with User Perception?

Abstract

Paper Structure (16 sections, 8 equations, 3 figures, 2 tables)

This paper contains 16 sections, 8 equations, 3 figures, 2 tables.

Introduction
Related Work
Data Acquisition and Study Design
Datasets and Counterfactual Generation
Sampling of Counterfactual Explanations
User Study Procedure
Power Analysis
Rating Aggregation and Reliability
Automated Metrics
Results
Metric--Rating Correlations
Predictive Modeling
Discussion
Conclusion
...and 1 more sections

Figures (3)

Figure 1: Metric--rating correlations including the CQS for (a) MUS, (b) OBE, (c) HRT datasets. Significant correlations ($p<0.05$) are marked with an asterisk.
Figure 2: Histograms of $R^2$ for (a) linear regression and (b) the best performing non-linear model, RF, on the HRT dataset predicting user satisfaction rating.
Figure 3: $R^2$ by number of included metrics for (a) linear regression and (b) RF, the best performing non-linear model, on the HRT dataset predicting user satisfaction rating. Lines show the best and mean $R^2$ across all metric combinations at each complexity level, with shaded bands indicating $\pm 1$ SD.

Do Metrics for Counterfactual Explanations Align with User Perception?

Abstract

Do Metrics for Counterfactual Explanations Align with User Perception?

Authors

Abstract

Table of Contents

Figures (3)