Table of Contents
Fetching ...

Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes

Connor Toups, Rishi Bommasani, Kathleen A. Creel, Sarah H. Bana, Dan Jurafsky, Percy Liang

TL;DR

The paper introduces ecosystem-level analysis to study the societal impact of deployed machine learning beyond model-level metrics, focusing on how multiple decision-makers interact to produce outcomes for individuals. It defines the failure matrix and systemic failures, and develops a baseline comparison using a Poisson-Binomial model to reveal homogenization across real deployments. Empirical findings across HAPI and DDI show that homogeneous outcomes are pervasive and largely persist even as individual models improve, with notable differences between models and humans in dermatology. The work highlights data- and model-centric explanations, discusses policy implications, and advocates for ecosystem-level monitoring and mitigations to align ML deployments with the public interest.

Abstract

Machine learning is traditionally studied at the model level: researchers measure and improve the accuracy, robustness, bias, efficiency, and other dimensions of specific models. In practice, the societal impact of machine learning is determined by the surrounding context of machine learning deployments. To capture this, we introduce ecosystem-level analysis: rather than analyzing a single model, we consider the collection of models that are deployed in a given context. For example, ecosystem-level analysis in hiring recognizes that a job candidate's outcomes are not only determined by a single hiring algorithm or firm but instead by the collective decisions of all the firms they applied to. Across three modalities (text, images, speech) and 11 datasets, we establish a clear trend: deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. Even when individual models improve at the population level over time, we find these improvements rarely reduce the prevalence of systemic failure. Instead, the benefits of these improvements predominantly accrue to individuals who are already correctly classified by other models. In light of these trends, we consider medical imaging for dermatology where the costs of systemic failure are especially high. While traditional analyses reveal racial performance disparities for both models and humans, ecosystem-level analysis reveals new forms of racial disparity in model predictions that do not present in human predictions. These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.

Ecosystem-level Analysis of Deployed Machine Learning Reveals Homogeneous Outcomes

TL;DR

The paper introduces ecosystem-level analysis to study the societal impact of deployed machine learning beyond model-level metrics, focusing on how multiple decision-makers interact to produce outcomes for individuals. It defines the failure matrix and systemic failures, and develops a baseline comparison using a Poisson-Binomial model to reveal homogenization across real deployments. Empirical findings across HAPI and DDI show that homogeneous outcomes are pervasive and largely persist even as individual models improve, with notable differences between models and humans in dermatology. The work highlights data- and model-centric explanations, discusses policy implications, and advocates for ecosystem-level monitoring and mitigations to align ML deployments with the public interest.

Abstract

Machine learning is traditionally studied at the model level: researchers measure and improve the accuracy, robustness, bias, efficiency, and other dimensions of specific models. In practice, the societal impact of machine learning is determined by the surrounding context of machine learning deployments. To capture this, we introduce ecosystem-level analysis: rather than analyzing a single model, we consider the collection of models that are deployed in a given context. For example, ecosystem-level analysis in hiring recognizes that a job candidate's outcomes are not only determined by a single hiring algorithm or firm but instead by the collective decisions of all the firms they applied to. Across three modalities (text, images, speech) and 11 datasets, we establish a clear trend: deployed machine learning is prone to systemic failure, meaning some users are exclusively misclassified by all models available. Even when individual models improve at the population level over time, we find these improvements rarely reduce the prevalence of systemic failure. Instead, the benefits of these improvements predominantly accrue to individuals who are already correctly classified by other models. In light of these trends, we consider medical imaging for dermatology where the costs of systemic failure are especially high. While traditional analyses reveal racial performance disparities for both models and humans, ecosystem-level analysis reveals new forms of racial disparity in model predictions that do not present in human predictions. These examples demonstrate ecosystem-level analysis has unique strengths for characterizing the societal impact of machine learning.
Paper Structure (44 sections, 2 equations, 16 figures, 1 table)

This paper contains 44 sections, 2 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: Ecosystem-level analysis. Individuals interact with decision-makers (left), receiving outcomes that constitute the failure matrix (right).
  • Figure 2: Homogeneous outcomes. Ecosystem-level analysis surfaces the general trend of homogeneous outcomes: the observed rates that all models succeed/fail consistently exceeds the corresponding baseline rates. \ref{['fig:digit_polarization']} shows that models in the DIGIT dataset are more likely to all fail or all succeed on an instance than baseline. \ref{['fig:polarization_all']} shows that across all datasets, systemic failure (red dots) and consistent success (blue dots) of all three models on an instance are both more common than baseline, whereas intermediate results are less common than baseline.
  • Figure 3: Examples of homogeneous outcomes Instances that are sampled uniformly at random from "0 models correct" (top row) or "3 models correct" (bottom row) in fer+. The systemic failures (top row) do not appear to be inherently harder for humans to classify; more extensive analysis appears in the supplement.
  • Figure 4: Model improvement is not concentrated on systemic failures. When a model improves, we compare the distribution of outcome profiles of the other two models on its initial failures (potential improvements) to the distribution on the instances it improved on (observed improvements). Across all improvements, including Amazon's improvement on waimai (left), there is a clear over-improvement on [✓, ✓] (above $y = x$ on right) and under-improvement on [X, X] (below the identity line on right).
  • Figure 5: Homogeneous outcomes for models and humans. Consistent with HAPI, model predictions (left) yield homogenous outcomes on DDI. Human predictions (right) are even more homogenous than models.
  • ...and 11 more figures