Table of Contents
Fetching ...

Using predictive multiplicity to measure individual performance within the AI Act

Karolin Frohnapfel, Mara Seyfert, Sebastian Bordt, Ulrike von Luxburg, Kristof Meding

TL;DR

The paper links predictive multiplicity to the EU AI Act's accuracy and transparency requirements, arguing that relying on a single best model can misrepresent individual performance. It introduces practical tools, the conflict ratio and $δ$-ambiguity, and an ad-hoc Rashomon-set construction to quantify and report disagreement across models with comparable $acc(g)$. Through synthetic and ACS data experiments, it shows that incorporating dataset multiplicity is essential to reflect individual-level performance and to guide deployers toward human oversight when predictions are highly conflicting. The authors advocate making multiplicity information available to deployers to support compliant, trustworthy high-risk AI systems.

Abstract

When building AI systems for decision support, one often encounters the phenomenon of predictive multiplicity: a single best model does not exist; instead, one can construct many models with similar overall accuracy that differ in their predictions for individual cases. Especially when decisions have a direct impact on humans, this can be highly unsatisfactory. For a person subject to high disagreement between models, one could as well have chosen a different model of similar overall accuracy that would have decided the person's case differently. We argue that this arbitrariness conflicts with the EU AI Act, which requires providers of high-risk AI systems to report performance not only at the dataset level but also for specific persons. The goal of this paper is to put predictive multiplicity in context with the EU AI Act's provisions on accuracy and to subsequently derive concrete suggestions on how to evaluate and report predictive multiplicity in practice. Specifically: (1) We argue that incorporating information about predictive multiplicity can serve compliance with the EU AI Act's accuracy provisions for providers. (2) Based on this legal analysis, we suggest individual conflict ratios and $δ$-ambiguity as tools to quantify the disagreement between models on individual cases and to help detect individuals subject to conflicting predictions. (3) Based on computational insights, we derive easy-to-implement rules on how model providers could evaluate predictive multiplicity in practice. (4) Ultimately, we suggest that information about predictive multiplicity should be made available to deployers under the AI Act, enabling them to judge whether system outputs for specific individuals are reliable enough for their use case.

Using predictive multiplicity to measure individual performance within the AI Act

TL;DR

The paper links predictive multiplicity to the EU AI Act's accuracy and transparency requirements, arguing that relying on a single best model can misrepresent individual performance. It introduces practical tools, the conflict ratio and -ambiguity, and an ad-hoc Rashomon-set construction to quantify and report disagreement across models with comparable . Through synthetic and ACS data experiments, it shows that incorporating dataset multiplicity is essential to reflect individual-level performance and to guide deployers toward human oversight when predictions are highly conflicting. The authors advocate making multiplicity information available to deployers to support compliant, trustworthy high-risk AI systems.

Abstract

When building AI systems for decision support, one often encounters the phenomenon of predictive multiplicity: a single best model does not exist; instead, one can construct many models with similar overall accuracy that differ in their predictions for individual cases. Especially when decisions have a direct impact on humans, this can be highly unsatisfactory. For a person subject to high disagreement between models, one could as well have chosen a different model of similar overall accuracy that would have decided the person's case differently. We argue that this arbitrariness conflicts with the EU AI Act, which requires providers of high-risk AI systems to report performance not only at the dataset level but also for specific persons. The goal of this paper is to put predictive multiplicity in context with the EU AI Act's provisions on accuracy and to subsequently derive concrete suggestions on how to evaluate and report predictive multiplicity in practice. Specifically: (1) We argue that incorporating information about predictive multiplicity can serve compliance with the EU AI Act's accuracy provisions for providers. (2) Based on this legal analysis, we suggest individual conflict ratios and -ambiguity as tools to quantify the disagreement between models on individual cases and to help detect individuals subject to conflicting predictions. (3) Based on computational insights, we derive easy-to-implement rules on how model providers could evaluate predictive multiplicity in practice. (4) Ultimately, we suggest that information about predictive multiplicity should be made available to deployers under the AI Act, enabling them to judge whether system outputs for specific individuals are reliable enough for their use case.
Paper Structure (34 sections, 5 equations, 7 figures, 1 table)

This paper contains 34 sections, 5 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: Our work in a nutshell. Predictive multiplicity means that classifiers with the same statistical accuracy may decide a large number of individual cases differently (left side). We argue that this challenges the AI Act requirements on accuracy and transparency (right side). To resolve this issue, we recommend that providers and deployers use methods to identify conflicting cases and involve human oversight in the decision-making process.
  • Figure 2: The importance of dataset multiplicity motivated on two-dimensional synthetic data. For fixed parametric multiplicity, we compare different ways to include dataset multiplicity into the ad-hoc approach from (a) no dataset multiplicity to (d) full dataset multiplicity. The ability to discover conflicting data points in the overlap region $\mathcal{N}((5, 5), (1,1))$ increases with stronger integration of dataset multiplicity (scatter plots; left to right). The $\delta$-ambiguity curves (right plot) depict the extent of the improvement for different conflict ratio thresholds $\delta$. The ground truth would be a constant $0.5$ curve.
  • Figure 3: The importance of dataset multiplicity on ACSEmployment data. For fixed parametric multiplicity, we compare different ways to include dataset multiplicity into the ad-hoc approach from (a) no dataset multiplicity to (f) full dataset multiplicity. The integration of dataset multiplicity increases the ability to discover conflicting data points (left plot) and highly conflicting data points (middle plot). The effectiveness differs between approaches. All approaches that include dataset multiplicity produce similar individual conflict ratios with small pairwise distances (right plot).
  • Figure 4: Comparison of the ad-hoc approach and TreeFarms Xin_2022_ExploringTheWholeRashomonSet on COMPAS data. With the ad-hoc approach, we train Rashomon sets of different sizes and track the $\delta$-ambiguities. Results are compared to the ground truth Rashomon set found by TreeFarms. The dataset has $12$ features, and models are trained on a training set of size $4,144$ and evaluated on a test set of size $2,763$.
  • Figure 5: The best accuracy is displayed depending on the size of the subset of the data.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Definition 4.1: Rashomon set
  • Definition 4.2: Conflict ratio
  • Definition 4.3: $\delta$-Ambiguity