Table of Contents
Fetching ...

Predictive Churn with the Set of Good Models

Jamelle Watson-Daniels, Flavio du Pin Calmon, Alexander D'Amour, Carol Long, David C. Parkes, Berk Ustun

TL;DR

This paper addresses predictive inconsistency by linking two concepts: predictive multiplicity, where near‑optimal models disagree on individual predictions, and predictive churn, where predictions change after model updates. It builds a theoretical framework around Rashomon sets, beta-stability, and churn bounds, deriving $\mathbb{E}[C_{\\gamma}(h'_A,h'_B)] \leq \frac{\\beta \sqrt{\\pi n}}{\\gamma} + 2\\epsilon$ and $C(h_0,h') \leq 2 \\hat{R}(h_0) + \\epsilon$, then extends these ideas to empirical Rashomon sets produced by randomized training. The authors provide empirical evidence across four datasets showing that reducing predictive multiplicity via uncertainty-aware models or ensembles can also reduce predictive churn, and that unstable instances identified via multiplicity often overlap with churn-unstable instances. Practically, the work suggests a workflow where multiplicity analysis informs churn risk and deployment decisions, promoting integrated strategies that improve reliability and accountability in machine learning systems.

Abstract

Issues can arise when research focused on fairness, transparency, or safety is conducted separately from research driven by practical deployment concerns and vice versa. This separation creates a growing need for translational work that bridges the gap between independently studied concepts that may be fundamentally related. This paper explores connections between two seemingly unrelated concepts of predictive inconsistency that share intriguing parallels. The first, known as predictive multiplicity, occurs when models that perform similarly (e.g., nearly equivalent training loss) produce conflicting predictions for individual samples. This concept is often emphasized in algorithmic fairness research as a means of promoting transparency in ML model development. The second concept, predictive churn, examines the differences in individual predictions before and after model updates, a key challenge in deploying ML models in consumer-facing applications. We present theoretical and empirical results that uncover links between these previously disconnected concepts.

Predictive Churn with the Set of Good Models

TL;DR

This paper addresses predictive inconsistency by linking two concepts: predictive multiplicity, where near‑optimal models disagree on individual predictions, and predictive churn, where predictions change after model updates. It builds a theoretical framework around Rashomon sets, beta-stability, and churn bounds, deriving and , then extends these ideas to empirical Rashomon sets produced by randomized training. The authors provide empirical evidence across four datasets showing that reducing predictive multiplicity via uncertainty-aware models or ensembles can also reduce predictive churn, and that unstable instances identified via multiplicity often overlap with churn-unstable instances. Practically, the work suggests a workflow where multiplicity analysis informs churn risk and deployment decisions, promoting integrated strategies that improve reliability and accountability in machine learning systems.

Abstract

Issues can arise when research focused on fairness, transparency, or safety is conducted separately from research driven by practical deployment concerns and vice versa. This separation creates a growing need for translational work that bridges the gap between independently studied concepts that may be fundamentally related. This paper explores connections between two seemingly unrelated concepts of predictive inconsistency that share intriguing parallels. The first, known as predictive multiplicity, occurs when models that perform similarly (e.g., nearly equivalent training loss) produce conflicting predictions for individual samples. This concept is often emphasized in algorithmic fairness research as a means of promoting transparency in ML model development. The second concept, predictive churn, examines the differences in individual predictions before and after model updates, a key challenge in deploying ML models in consumer-facing applications. We present theoretical and empirical results that uncover links between these previously disconnected concepts.
Paper Structure (41 sections, 5 theorems, 30 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 5 theorems, 30 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 5.2

Assume a training algorithm that is $\beta$-stable. Given two $\epsilon$-Rashomon sets defined with respect to the baseline models, $\mathcal{R}_\epsilon({h{}_0^A})$ and $\mathcal{R}_\epsilon({h{}_0^B})$, the smooth churn between any pair of models within the two $\epsilon$-Rashomon sets: $h{}'_A \i

Figures (4)

  • Figure 1: Predicted probability distributions for the Adult Dataset. We plot a histogram of predicted probability distribution in grey with the left $y$-axis and a scatter plot of the proportion of flip counts for each bin aligned with the right $y$-axis. By overlapping the plots, we gain a comprehensive view of the model's confidence in its predictions (via the histogram) and the areas where the model predictions are most prone to change (scatter plot of flips). Notice that the scale is different between the histogram and the flip counts. The top row corresponds to the DNN experiments and the bottom row are the UA-DNN experiments. Each column represents an experiment. From the left, we show results for predictive multiplicity, large dataset update, and small dataset update.
  • Figure 2: Predicted probability distributions for Credit Dataset.
  • Figure 3: Predicted probability distributions for HDMA Dataset.
  • Figure 4: Pearson correlation between features, predicted probabilities ($p$), ambiguity indiciator and churn indicator. Top left is adult, top right is mammo, bottom left is hmda, bottom right is credit. Results shown for DNN model.

Theorems & Definitions (20)

  • Definition 3.1: Predictive churn LaunchAndIterate
  • Definition 3.2: Churn Unstable Set
  • Definition 3.3: $\epsilon$-Rashomon Set w.r.t. $h_0$
  • Definition 3.4: Empirical $\epsilon$-Rashomon set
  • Definition 3.5: Empirical $\epsilon$-Ambiguity
  • Definition 4.1: Ensemble Classifier long2023arbitrariness
  • Definition 5.1: $\beta$-stability LaunchAndIterate
  • Theorem 5.2: Expected Churn between Rashomon Sets
  • Lemma 5.3: Bound on Churn
  • Corollary 5.4: Bound on Churn within $\mathcal{R}_\epsilon$
  • ...and 10 more