Predictive Churn with the Set of Good Models
Jamelle Watson-Daniels, Flavio du Pin Calmon, Alexander D'Amour, Carol Long, David C. Parkes, Berk Ustun
TL;DR
This paper addresses predictive inconsistency by linking two concepts: predictive multiplicity, where near‑optimal models disagree on individual predictions, and predictive churn, where predictions change after model updates. It builds a theoretical framework around Rashomon sets, beta-stability, and churn bounds, deriving $\mathbb{E}[C_{\\gamma}(h'_A,h'_B)] \leq \frac{\\beta \sqrt{\\pi n}}{\\gamma} + 2\\epsilon$ and $C(h_0,h') \leq 2 \\hat{R}(h_0) + \\epsilon$, then extends these ideas to empirical Rashomon sets produced by randomized training. The authors provide empirical evidence across four datasets showing that reducing predictive multiplicity via uncertainty-aware models or ensembles can also reduce predictive churn, and that unstable instances identified via multiplicity often overlap with churn-unstable instances. Practically, the work suggests a workflow where multiplicity analysis informs churn risk and deployment decisions, promoting integrated strategies that improve reliability and accountability in machine learning systems.
Abstract
Issues can arise when research focused on fairness, transparency, or safety is conducted separately from research driven by practical deployment concerns and vice versa. This separation creates a growing need for translational work that bridges the gap between independently studied concepts that may be fundamentally related. This paper explores connections between two seemingly unrelated concepts of predictive inconsistency that share intriguing parallels. The first, known as predictive multiplicity, occurs when models that perform similarly (e.g., nearly equivalent training loss) produce conflicting predictions for individual samples. This concept is often emphasized in algorithmic fairness research as a means of promoting transparency in ML model development. The second concept, predictive churn, examines the differences in individual predictions before and after model updates, a key challenge in deploying ML models in consumer-facing applications. We present theoretical and empirical results that uncover links between these previously disconnected concepts.
