Predictive Churn with the Set of Good Models

Jamelle Watson-Daniels; Flavio du Pin Calmon; Alexander D'Amour; Carol Long; David C. Parkes; Berk Ustun

Predictive Churn with the Set of Good Models

Jamelle Watson-Daniels, Flavio du Pin Calmon, Alexander D'Amour, Carol Long, David C. Parkes, Berk Ustun

TL;DR

This paper addresses predictive inconsistency by linking two concepts: predictive multiplicity, where near‑optimal models disagree on individual predictions, and predictive churn, where predictions change after model updates. It builds a theoretical framework around Rashomon sets, beta-stability, and churn bounds, deriving $\mathbb{E}[C_{\\gamma}(h'_A,h'_B)] \leq \frac{\\beta \sqrt{\\pi n}}{\\gamma} + 2\\epsilon$ and $C(h_0,h') \leq 2 \\hat{R}(h_0) + \\epsilon$, then extends these ideas to empirical Rashomon sets produced by randomized training. The authors provide empirical evidence across four datasets showing that reducing predictive multiplicity via uncertainty-aware models or ensembles can also reduce predictive churn, and that unstable instances identified via multiplicity often overlap with churn-unstable instances. Practically, the work suggests a workflow where multiplicity analysis informs churn risk and deployment decisions, promoting integrated strategies that improve reliability and accountability in machine learning systems.

Abstract

Issues can arise when research focused on fairness, transparency, or safety is conducted separately from research driven by practical deployment concerns and vice versa. This separation creates a growing need for translational work that bridges the gap between independently studied concepts that may be fundamentally related. This paper explores connections between two seemingly unrelated concepts of predictive inconsistency that share intriguing parallels. The first, known as predictive multiplicity, occurs when models that perform similarly (e.g., nearly equivalent training loss) produce conflicting predictions for individual samples. This concept is often emphasized in algorithmic fairness research as a means of promoting transparency in ML model development. The second concept, predictive churn, examines the differences in individual predictions before and after model updates, a key challenge in deploying ML models in consumer-facing applications. We present theoretical and empirical results that uncover links between these previously disconnected concepts.

Predictive Churn with the Set of Good Models

TL;DR

and

, then extends these ideas to empirical Rashomon sets produced by randomized training. The authors provide empirical evidence across four datasets showing that reducing predictive multiplicity via uncertainty-aware models or ensembles can also reduce predictive churn, and that unstable instances identified via multiplicity often overlap with churn-unstable instances. Practically, the work suggests a workflow where multiplicity analysis informs churn risk and deployment decisions, promoting integrated strategies that improve reliability and accountability in machine learning systems.

Abstract

Paper Structure (41 sections, 5 theorems, 30 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 41 sections, 5 theorems, 30 equations, 4 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Model Multiplicity
Predictive Churn
Uncertainty Quantification
Backward Compatibility
Underspecification and Reproducibility
Framework
Predictive Churn
Predictive Multiplicity
Multiplicity with respect to a baseline:
Multiplicity without a baseline:
Predictive Multiplicity Metric: Ambiguity
Methodology
Enhanced Uncertainty Quantification
...and 26 more sections

Key Result

Theorem 5.2

Assume a training algorithm that is $\beta$-stable. Given two $\epsilon$-Rashomon sets defined with respect to the baseline models, $\mathcal{R}_\epsilon({h{}_0^A})$ and $\mathcal{R}_\epsilon({h{}_0^B})$, the smooth churn between any pair of models within the two $\epsilon$-Rashomon sets: $h{}'_A \i

Figures (4)

Figure 1: Predicted probability distributions for the Adult Dataset. We plot a histogram of predicted probability distribution in grey with the left $y$-axis and a scatter plot of the proportion of flip counts for each bin aligned with the right $y$-axis. By overlapping the plots, we gain a comprehensive view of the model's confidence in its predictions (via the histogram) and the areas where the model predictions are most prone to change (scatter plot of flips). Notice that the scale is different between the histogram and the flip counts. The top row corresponds to the DNN experiments and the bottom row are the UA-DNN experiments. Each column represents an experiment. From the left, we show results for predictive multiplicity, large dataset update, and small dataset update.
Figure 2: Predicted probability distributions for Credit Dataset.
Figure 3: Predicted probability distributions for HDMA Dataset.
Figure 4: Pearson correlation between features, predicted probabilities ($p$), ambiguity indiciator and churn indicator. Top left is adult, top right is mammo, bottom left is hmda, bottom right is credit. Results shown for DNN model.

Theorems & Definitions (20)

Definition 3.1: Predictive churn LaunchAndIterate
Definition 3.2: Churn Unstable Set
Definition 3.3: $\epsilon$-Rashomon Set w.r.t. $h_0$
Definition 3.4: Empirical $\epsilon$-Rashomon set
Definition 3.5: Empirical $\epsilon$-Ambiguity
Definition 4.1: Ensemble Classifier long2023arbitrariness
Definition 5.1: $\beta$-stability LaunchAndIterate
Theorem 5.2: Expected Churn between Rashomon Sets
Lemma 5.3: Bound on Churn
Corollary 5.4: Bound on Churn within $\mathcal{R}_\epsilon$
...and 10 more

Predictive Churn with the Set of Good Models

TL;DR

Abstract

Predictive Churn with the Set of Good Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (20)