Table of Contents
Fetching ...

Controlling Counterfactual Harm in Decision Support Systems Based on Prediction Sets

Eleni Straitouri, Suhas Thejaswi, Manuel Gomez Rodriguez

TL;DR

This work addresses the risk of counterfactual harm when decision-support systems provide prediction sets rather than single labels. It models human predictions with a structural causal model and defines counterfactual harm $h_{\lambda}(x,y,\hat{y})$, aiming to ensure $H(\lambda) = \mathbb{E}[h_{\lambda}(X,Y,\hat{Y})] \le \alpha$ for a user-specified bound $\alpha$. The authors establish identifiability results under counterfactual monotonicity and partially identifiable bounds under interventional monotonicity, then design a conformal risk-control framework to construct harm-guaranteed prediction-set systems. Using two human-subject datasets, they demonstrate a consistent trade-off: increasing protection from harm (larger $\lambda$) reduces harm but can lower human accuracy, with guarantees improving as calibration data grows. This provides a practical pathway to deploy set-valued decision aids with quantifiable safety guarantees in high-stakes domains.

Abstract

Decision support systems based on prediction sets help humans solve multiclass classification tasks by narrowing down the set of potential label values to a subset of them, namely a prediction set, and asking them to always predict label values from the prediction sets. While this type of systems have been proven to be effective at improving the average accuracy of the predictions made by humans, by restricting human agency, they may cause harm$\unicode{x2014}$a human who has succeeded at predicting the ground-truth label of an instance on their own may have failed had they used these systems. In this paper, our goal is to control how frequently a decision support system based on prediction sets may cause harm, by design. To this end, we start by characterizing the above notion of harm using the theoretical framework of structural causal models. Then, we show that, under a natural, albeit unverifiable, monotonicity assumption, we can estimate how frequently a system may cause harm using only predictions made by humans on their own. Further, we also show that, under a weaker monotonicity assumption, which can be verified experimentally, we can bound how frequently a system may cause harm again using only predictions made by humans on their own. Building upon these assumptions, we introduce a computational framework to design decision support systems based on prediction sets that are guaranteed to cause harm less frequently than a user-specified value using conformal risk control. We validate our framework using real human predictions from two different human subject studies and show that, in decision support systems based on prediction sets, there is a trade-off between accuracy and counterfactual harm.

Controlling Counterfactual Harm in Decision Support Systems Based on Prediction Sets

TL;DR

This work addresses the risk of counterfactual harm when decision-support systems provide prediction sets rather than single labels. It models human predictions with a structural causal model and defines counterfactual harm , aiming to ensure for a user-specified bound . The authors establish identifiability results under counterfactual monotonicity and partially identifiable bounds under interventional monotonicity, then design a conformal risk-control framework to construct harm-guaranteed prediction-set systems. Using two human-subject datasets, they demonstrate a consistent trade-off: increasing protection from harm (larger ) reduces harm but can lower human accuracy, with guarantees improving as calibration data grows. This provides a practical pathway to deploy set-valued decision aids with quantifiable safety guarantees in high-stakes domains.

Abstract

Decision support systems based on prediction sets help humans solve multiclass classification tasks by narrowing down the set of potential label values to a subset of them, namely a prediction set, and asking them to always predict label values from the prediction sets. While this type of systems have been proven to be effective at improving the average accuracy of the predictions made by humans, by restricting human agency, they may cause harma human who has succeeded at predicting the ground-truth label of an instance on their own may have failed had they used these systems. In this paper, our goal is to control how frequently a decision support system based on prediction sets may cause harm, by design. To this end, we start by characterizing the above notion of harm using the theoretical framework of structural causal models. Then, we show that, under a natural, albeit unverifiable, monotonicity assumption, we can estimate how frequently a system may cause harm using only predictions made by humans on their own. Further, we also show that, under a weaker monotonicity assumption, which can be verified experimentally, we can bound how frequently a system may cause harm again using only predictions made by humans on their own. Building upon these assumptions, we introduce a computational framework to design decision support systems based on prediction sets that are guaranteed to cause harm less frequently than a user-specified value using conformal risk control. We validate our framework using real human predictions from two different human subject studies and show that, in decision support systems based on prediction sets, there is a trade-off between accuracy and counterfactual harm.
Paper Structure (20 sections, 10 theorems, 46 equations, 15 figures)

This paper contains 20 sections, 10 theorems, 46 equations, 15 figures.

Key Result

Proposition 1

Under the counterfactual monotonicity assumption, for any $x, y, \hat{y} \sim P^{\mathcal{M}}$, the counterfactual harm that a decision support system $\mathcal{C}_{\lambda}$ would have caused, if deployed, is given by

Figures (15)

  • Figure 1: Our structural causal model $\mathcal{M}$. Circles represent endogenous random variables and boxes represent exogenous random variables. The value of each endogenous variable is given by a function of the values of its ancestors, as defined by Eq. \ref{['eq:scm']}. The value of each exogenous variable is sampled independently from a given distribution.
  • Figure 2: Average accuracy estimated by the mixture of MNLs against the average counterfactual harm for images with $\omega = 110$. Each point corresponds to a $\lambda$ value from $0$ to $1$ with step $0.001$ and the coloring indicates the relative frequency with which each $\lambda$ value is in $\Lambda(\alpha)$ across random samplings of the calibration set. Each row corresponds to decision support systems $\mathcal{C}_{\lambda}$ with a different pre-trained classifier with average accuracies $0.846$ (VGG19), $0.830$ (DenseNet), $0.722$ (GoogleNet), $0.727$ (ResNet152), and $0.691$ (AlexNet). The average accuracy achieved by the simulated human experts on their own is $0.771$. The results are averaged across $50$ random samplings of the test and calibration set. In both panels, $95\%$ confidence intervals are represented using shaded areas and always have width below $0.02$.
  • Figure 3: Average accuracy estimated using predictions by human participants (Real) and using the mixture of MNLs (Predicted) against the average counterfactual harm for images with $\omega=110$. Each point corresponds to a $\lambda$ value from $0$ to $1$ with step $0.001$ and the coloring indicates the relative frequency with which the $\lambda$ value is in $\Lambda(\alpha)$ across random samplings of the calibration set. The decision support systems $\mathcal{C}_{\lambda}$ use the pre-trained classifier VGG19. The results are averaged across $50$ random samplings of the test and calibration set. In both panels, $95\%$ confidence intervals are represented using shaded areas and always have width below $0.02$.
  • Figure 4: Average accuracy estimated by the mixture of MNLs against the average counterfactual harm for images with $\omega = 80$. Each point corresponds to a $\lambda$ value from $0$ to $1$ with step $0.001$ and the coloring indicates the relative frequency with which each $\lambda$ value is in $\Lambda(\alpha)$ across random samplings of the calibration set. Each row corresponds to decision support systems $\mathcal{C}_{\lambda}$ with a different pre-trained classifier with average accuracies $0.891$ (VGG19), $0.892$ (DenseNet161), $0.802$ (GoogleNet), $0.804$ (ResNet152), and $0.784$ (AlexNet). The average accuracy achieved by human experts on their own is $0.9$. The results are averaged across $50$ random samplings of the test and calibration set. In both panels, $95\%$ confidence intervals have width always below $0.02$ and are represented using shaded areas.
  • Figure 5: Average accuracy estimated by the mixture of MNLs against the average counterfactual harm for images with $\omega = 95$. Each point corresponds to a $\lambda$ value from $0$ to $1$ with step $0.001$ and the coloring indicates the relative frequency with which each $\lambda$ value is in $\Lambda(\alpha)$ across random samplings of the calibration set. Each row corresponds to decision support systems $\mathcal{C}_{\lambda}$ with a different pre-trained classifier with average accuracies $0.88$ (VGG19), $0.868$ (DenseNet161), $0.775$ (GoogleNet), $0.773$ (ResNet152), and $0.745$ (AlexNet). The average accuracy achieved by human experts on their own is $0.86$. The results are averaged across $50$ random samplings of the test and calibration set. In both panels, $95\%$ confidence intervals have width always below $0.02$ and are represented using shaded areas.
  • ...and 10 more figures

Theorems & Definitions (11)

  • Definition 1: Counterfactual Harm
  • Proposition 1
  • Proposition 2
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Corollary 2
  • Lemma 1
  • Lemma 2
  • Lemma 3
  • ...and 1 more