Table of Contents
Fetching ...

Perceptions of the Fairness Impacts of Multiplicity in Machine Learning

Anna P. Meyer, Yea-Seul Kim, Aws Albarghouthi, Loris D'Antoni

TL;DR

This paper investigates whether lay stakeholders perceive multiplicity in ML as a fairness risk and how they prefer to resolve it. Using a two-part survey with an educational component and a conjoint-style analysis, it finds that multiplicity does not broadly erode perceived fairness, though participants dislike ignoring multiplicity or randomization and show a clear preference for human-in-the-loop or more sophisticated resolution methods. Preferences vary with task stakes and framing, suggesting that practical ML deployment should tailor multiplicity-handling mechanisms to each context. The study highlights a gap between philosophical arguments for randomization and lay expectations, and it calls for greater transparency and stakeholder-aligned design in ML systems exhibiting multiplicity.

Abstract

Machine learning (ML) is increasingly used in high-stakes settings, yet multiplicity - the existence of multiple good models - means that some predictions are essentially arbitrary. ML researchers and philosophers posit that multiplicity poses a fairness risk, but no studies have investigated whether stakeholders agree. In this work, we conduct a survey to see how multiplicity impacts lay stakeholders' - i.e., decision subjects' - perceptions of ML fairness, and which approaches to address multiplicity they prefer. We investigate how these perceptions are modulated by task characteristics (e.g., stakes and uncertainty). Survey respondents think that multiplicity threatens the fairness of model outcomes, but not the appropriateness of using the model, even though existing work suggests the opposite. Participants are strongly against resolving multiplicity by using a single model (effectively ignoring multiplicity) or by randomizing the outcomes. Our results indicate that model developers should be intentional about dealing with multiplicity in order to maintain fairness.

Perceptions of the Fairness Impacts of Multiplicity in Machine Learning

TL;DR

This paper investigates whether lay stakeholders perceive multiplicity in ML as a fairness risk and how they prefer to resolve it. Using a two-part survey with an educational component and a conjoint-style analysis, it finds that multiplicity does not broadly erode perceived fairness, though participants dislike ignoring multiplicity or randomization and show a clear preference for human-in-the-loop or more sophisticated resolution methods. Preferences vary with task stakes and framing, suggesting that practical ML deployment should tailor multiplicity-handling mechanisms to each context. The study highlights a gap between philosophical arguments for randomization and lay expectations, and it calls for greater transparency and stakeholder-aligned design in ML systems exhibiting multiplicity.

Abstract

Machine learning (ML) is increasingly used in high-stakes settings, yet multiplicity - the existence of multiple good models - means that some predictions are essentially arbitrary. ML researchers and philosophers posit that multiplicity poses a fairness risk, but no studies have investigated whether stakeholders agree. In this work, we conduct a survey to see how multiplicity impacts lay stakeholders' - i.e., decision subjects' - perceptions of ML fairness, and which approaches to address multiplicity they prefer. We investigate how these perceptions are modulated by task characteristics (e.g., stakes and uncertainty). Survey respondents think that multiplicity threatens the fairness of model outcomes, but not the appropriateness of using the model, even though existing work suggests the opposite. Participants are strongly against resolving multiplicity by using a single model (effectively ignoring multiplicity) or by randomizing the outcomes. Our results indicate that model developers should be intentional about dealing with multiplicity in order to maintain fairness.
Paper Structure (90 sections, 5 figures, 8 tables)

This paper contains 90 sections, 5 figures, 8 tables.

Figures (5)

  • Figure 1: A: Example dataset with two classes (red circle and blue triangle) that, when limited to linear models, exhibits predictive multiplicity because no model achieves perfect accuracy while multiple models are correct on all but one prediction. The predictions in the yellow shaded region get a different prediction depending on whether we choose the "original" or "new" model. B: Example of model multiplicity based on model predictions. C: For the task of detecting fraudulent reviews, shows the overlap between reviews flagged as fraudulent by humans and two models. The two models both have significant overlap with human predictions, but less overlap with each other.
  • Figure 2: Overview of our survey design, including the preliminary study (top row) and main study (bottom row).
  • Figure 3: Marginal means for each multiplicity resolution technique. A score of 0.5 indicates the frequency the option would have been chosen if the selection is purely random, while a score greater (less) than 0.5 indicates an option is chosen more (less) frequently than would be expected by random chance.
  • Figure 4: Image that we include with the multiplicity education.
  • Figure 5: Marginal probabilities stratified by task stakes, task framing, and task uncertainty. For each "feature" (multiplicity resolution technique or cost level), the marginal probability is the chance of choosing that attribute, relative to a baseline random choice of 0.5. Results are further broken down by task characteristic. Results are darker when the confidence intervals for the two task levels do not overlap as a marker of greater significance. Confidences intervals were computed using $\alpha=0.05$ to ease readability, see the discussion in \ref{['sec:res_hypo']} for which differences are statistically significant with a multiple hypothesis correction.