Table of Contents
Fetching ...

Designing User-Centric Metrics for Evaluation of Counterfactual Explanations

Firdaus Ahmed Choudhury, Ethan Leicht, Jude Ethan Bislig, Hangzhi Guo, Amulya Yadav

Abstract

Counterfactual Explanations (CFEs) have grown in popularity as a means of offering actionable guidance by identifying the minimum changes in feature values required to flip an ML model's prediction to something more desirable. Unfortunately, most prior research on CFEs relies on artificial evaluation metrics, such as proximity, which may overlook end-user preferences and constraints, e.g., the user's perception of effort needed to make certain feature changes may differ from that of the model designer. To address this research gap, this paper makes three novel contributions. First, we conduct a pilot study with 20 crowd-workers on Amazon MTurk to experimentally validate the alignment of existing CF evaluation metrics with real-world user preferences. Results show that user-preferred CFEs matched those based on proximity in only 63.81% of cases, highlighting the limited applicability of these metrics in real-world settings. Second, inspired by the need to design a user-informed evaluation metric for CFEs, we conduct a more detailed two-day user study with 41 participants facing realistic credit application scenarios to find experimental support for or against three intuitive hypotheses that may explain how end users evaluate CFEs. Third, based on the findings of this second study, we propose the AWP model, a novel user-centric, two-stage model that describes one possible mechanism by which users evaluate and select CFEs. Our results show that AWP predicts user-preferred CFEs with 84.37% accuracy. Our study provides the first human-centered validation for personalized cost models in CFE generation and highlights the need for adaptive, user-centered evaluation metrics.

Designing User-Centric Metrics for Evaluation of Counterfactual Explanations

Abstract

Counterfactual Explanations (CFEs) have grown in popularity as a means of offering actionable guidance by identifying the minimum changes in feature values required to flip an ML model's prediction to something more desirable. Unfortunately, most prior research on CFEs relies on artificial evaluation metrics, such as proximity, which may overlook end-user preferences and constraints, e.g., the user's perception of effort needed to make certain feature changes may differ from that of the model designer. To address this research gap, this paper makes three novel contributions. First, we conduct a pilot study with 20 crowd-workers on Amazon MTurk to experimentally validate the alignment of existing CF evaluation metrics with real-world user preferences. Results show that user-preferred CFEs matched those based on proximity in only 63.81% of cases, highlighting the limited applicability of these metrics in real-world settings. Second, inspired by the need to design a user-informed evaluation metric for CFEs, we conduct a more detailed two-day user study with 41 participants facing realistic credit application scenarios to find experimental support for or against three intuitive hypotheses that may explain how end users evaluate CFEs. Third, based on the findings of this second study, we propose the AWP model, a novel user-centric, two-stage model that describes one possible mechanism by which users evaluate and select CFEs. Our results show that AWP predicts user-preferred CFEs with 84.37% accuracy. Our study provides the first human-centered validation for personalized cost models in CFE generation and highlights the need for adaptive, user-centered evaluation metrics.

Paper Structure

This paper contains 14 sections, 3 equations, 8 figures.

Figures (8)

  • Figure 1: CFE generation with a Decision Tree Classifier. Let $x$ denote a rejected loan application that reaches the red leaf $L$ in the decision tree. In order to generate counterfactual recourses $x'_1$ and $x'_2$ (that reside in leaves $L'_1$ and $L'_2$, respectively), the shortest path between leaves $L$ and $L'_1$ (or $L'_2$) needs to be traversed. In this figure, these shortest paths are denoted using dotted green and blue paths, respectively.
  • Figure 2: Workflow of our Two-Phase User Study
  • Figure 3: Participant selection rates for recourses generated using weighted proximity with non-personalized weights (left) vs personalized weights (right). The majority of participants aligned only moderately (40–60%) with global-weighted recourses, whereas they aligned very strongly ($>$80%) with personalized-weighted recourses.
  • Figure 4: Flowchart describing our iterative probing strategy to validate Hypothesis 2. Green cells correspond to decision points at which at least one (or more) feature-specific acceptability thresholds for the user can be inferred.
  • Figure 5: Histogram of acceptability thresholds for income and credit score features inferred through our iterative probing scenario across all 41 participants.
  • ...and 3 more figures