Table of Contents
Fetching ...

Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments

Marharyta Domnich, Julius Välja, Rasmus Moorits Veski, Giacomo Magnifico, Kadi Tulver, Eduard Barbu, Raul Vicente

TL;DR

This paper tackles the lack of human-grounded, standardized evaluation for counterfactual explanations by introducing CounterEval, a dataset of 30 counterfactual scenarios rated on eight human-centric metrics by 206 participants. It demonstrates that large language models can approximate human judgments both in zero-shot and fine-tuned settings, achieving up to $63\%$ and $85\%$ accuracy respectively across metrics, and shows potential for adapting to individual evaluators. The work provides a scalable, comparable benchmark for evaluating counterfactual explanations and outlines avenues for larger datasets and human-in-the-loop refinement. These findings hold practical significance for developing human-aligned explainability frameworks in AI systems and enabling consistent cross-method benchmarking.

Abstract

As machine learning models evolve, maintaining transparency demands more human-centric explainable AI techniques. Counterfactual explanations, with roots in human reasoning, identify the minimal input changes needed to obtain a given output and, hence, are crucial for supporting decision-making. Despite their importance, the evaluation of these explanations often lacks grounding in user studies and remains fragmented, with existing metrics not fully capturing human perspectives. To address this challenge, we developed a diverse set of 30 counterfactual scenarios and collected ratings across 8 evaluation metrics from 206 respondents. Subsequently, we fine-tuned different Large Language Models (LLMs) to predict average or individual human judgment across these metrics. Our methodology allowed LLMs to achieve an accuracy of up to 63% in zero-shot evaluations and 85% (over a 3-classes prediction) with fine-tuning across all metrics. The fine-tuned models predicting human ratings offer better comparability and scalability in evaluating different counterfactual explanation frameworks.

Towards Unifying Evaluation of Counterfactual Explanations: Leveraging Large Language Models for Human-Centric Assessments

TL;DR

This paper tackles the lack of human-grounded, standardized evaluation for counterfactual explanations by introducing CounterEval, a dataset of 30 counterfactual scenarios rated on eight human-centric metrics by 206 participants. It demonstrates that large language models can approximate human judgments both in zero-shot and fine-tuned settings, achieving up to and accuracy respectively across metrics, and shows potential for adapting to individual evaluators. The work provides a scalable, comparable benchmark for evaluating counterfactual explanations and outlines avenues for larger datasets and human-in-the-loop refinement. These findings hold practical significance for developing human-aligned explainability frameworks in AI systems and enabling consistent cross-method benchmarking.

Abstract

As machine learning models evolve, maintaining transparency demands more human-centric explainable AI techniques. Counterfactual explanations, with roots in human reasoning, identify the minimal input changes needed to obtain a given output and, hence, are crucial for supporting decision-making. Despite their importance, the evaluation of these explanations often lacks grounding in user studies and remains fragmented, with existing metrics not fully capturing human perspectives. To address this challenge, we developed a diverse set of 30 counterfactual scenarios and collected ratings across 8 evaluation metrics from 206 respondents. Subsequently, we fine-tuned different Large Language Models (LLMs) to predict average or individual human judgment across these metrics. Our methodology allowed LLMs to achieve an accuracy of up to 63% in zero-shot evaluations and 85% (over a 3-classes prediction) with fine-tuning across all metrics. The fine-tuned models predicting human ratings offer better comparability and scalability in evaluating different counterfactual explanation frameworks.

Paper Structure

This paper contains 21 sections, 6 figures, 13 tables.

Figures (6)

  • Figure 1: We created a diverse set of counterfactual scenarios where we varied feasibility, consistency, completeness, trust, fairness, complexity and understandability, resulting in 30 counterfactual questions which were evaluated by 206 human respondents on the 8 metrics. We subsequently divided data for fine-tuning several LLM models to assess every metric score and compared the results to human data on a reserved set.
  • Figure 2: Spearman correlation table between metrics. The values for Complexity were mapped linearly from the original [-2,2] scale to [1,6] to be in line with the other metrics.
  • Figure 3: Confusion matrices for Llama 3 70B Instruct for metric split: zero-shot model (a), fine-tuned (b); and question split: zero-shot (c) and fine-tuned (d).
  • Figure A.1: Questionnaire questions’ 7 average metric values (no Overall Satisfaction) reduced to 2 dimensions (t-SNE perplexity 3). Colored by average Overall Satisfaction
  • Figure C.1: Confusion matrices for GPT4 for metric split (a) and question split (b).
  • ...and 1 more figures