Table of Contents
Fetching ...

Reconciling Predictive Multiplicity in Practice

Tina Behzad, Sílvia Casacuberta, Emily Ruth Diana, Alexander Williams Tolbert

TL;DR

This work tackles predictive multiplicity by studying the Reconcile algorithm, which iteratively patches two disagreeing predictors to reduce error and disagreement without restricting to a fixed hypothesis class. It delivers the first full implementation and extensive empirical evaluation on five fairness datasets, showing rapid convergence (1–7 rounds) and systematic reductions in model disagreement while improving Brier scores, often achieving near-complete agreement. The authors further extend Reconcile to causal inference via ReconcileCATE, obtaining similar guarantees for heterogeneous treatment effect estimators and validating them on causal benchmarks (Twins and National Study). They also analyze sequential Reconcile as a model-aggregation method, establishing robustness and fairness benefits, and relate Reconcile to multiaccuracy, illustrating a unified view of disagreement-witness patching. Overall, the work demonstrates that Reconcile is practically effective for reducing predictive multiplicity and can extend to causal settings, with significant implications for fairness, robustness, and decision-making under model uncertainty.

Abstract

Many machine learning applications predict individual probabilities, such as the likelihood that a person develops a particular illness. Since these probabilities are unknown, a key question is how to address situations in which different models trained on the same dataset produce varying predictions for certain individuals. This issue is exemplified by the model multiplicity (MM) phenomenon, where a set of comparable models yield inconsistent predictions. Roth, Tolbert, and Weinstein recently introduced a reconciliation procedure, the Reconcile algorithm, to address this problem. Given two disagreeing models, the algorithm leverages their disagreement to falsify and improve at least one of the models. In this paper, we empirically analyze the Reconcile algorithm using five widely-used fairness datasets: COMPAS, Communities and Crime, Adult, Statlog (German Credit Data), and the ACS Dataset. We examine how Reconcile fits within the model multiplicity literature and compare it to existing MM solutions, demonstrating its effectiveness. We also discuss potential improvements to the Reconcile algorithm theoretically and practically. Finally, we extend the Reconcile algorithm to the setting of causal inference, given that different competing estimators can again disagree on specific causal average treatment effect (CATE) values. We present the first extension of the Reconcile algorithm in causal inference, analyze its theoretical properties, and conduct empirical tests. Our results confirm the practical effectiveness of Reconcile and its applicability across various domains.

Reconciling Predictive Multiplicity in Practice

TL;DR

This work tackles predictive multiplicity by studying the Reconcile algorithm, which iteratively patches two disagreeing predictors to reduce error and disagreement without restricting to a fixed hypothesis class. It delivers the first full implementation and extensive empirical evaluation on five fairness datasets, showing rapid convergence (1–7 rounds) and systematic reductions in model disagreement while improving Brier scores, often achieving near-complete agreement. The authors further extend Reconcile to causal inference via ReconcileCATE, obtaining similar guarantees for heterogeneous treatment effect estimators and validating them on causal benchmarks (Twins and National Study). They also analyze sequential Reconcile as a model-aggregation method, establishing robustness and fairness benefits, and relate Reconcile to multiaccuracy, illustrating a unified view of disagreement-witness patching. Overall, the work demonstrates that Reconcile is practically effective for reducing predictive multiplicity and can extend to causal settings, with significant implications for fairness, robustness, and decision-making under model uncertainty.

Abstract

Many machine learning applications predict individual probabilities, such as the likelihood that a person develops a particular illness. Since these probabilities are unknown, a key question is how to address situations in which different models trained on the same dataset produce varying predictions for certain individuals. This issue is exemplified by the model multiplicity (MM) phenomenon, where a set of comparable models yield inconsistent predictions. Roth, Tolbert, and Weinstein recently introduced a reconciliation procedure, the Reconcile algorithm, to address this problem. Given two disagreeing models, the algorithm leverages their disagreement to falsify and improve at least one of the models. In this paper, we empirically analyze the Reconcile algorithm using five widely-used fairness datasets: COMPAS, Communities and Crime, Adult, Statlog (German Credit Data), and the ACS Dataset. We examine how Reconcile fits within the model multiplicity literature and compare it to existing MM solutions, demonstrating its effectiveness. We also discuss potential improvements to the Reconcile algorithm theoretically and practically. Finally, we extend the Reconcile algorithm to the setting of causal inference, given that different competing estimators can again disagree on specific causal average treatment effect (CATE) values. We present the first extension of the Reconcile algorithm in causal inference, analyze its theoretical properties, and conduct empirical tests. Our results confirm the practical effectiveness of Reconcile and its applicability across various domains.

Paper Structure

This paper contains 35 sections, 8 theorems, 38 equations, 14 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

Given any pair of models $f_1, f_2: \mathcal{X} \rightarrow [0,1]$, distribution $\mathcal{D}$ on $\mathcal{X} \times \mathcal{Y}$, and parameters $\alpha, \epsilon>0$, the two models $(f^{T_1}_1, f^{T_2}_2)$ returned by Reconcile when run on $(f_1, f_2, \mathcal{D}, \alpha, \epsilon)$, where $T = T

Figures (14)

  • Figure 1: Diagram of the Reconcile algorithm.
  • Figure 2: Disagreement levels between $f_1$ and $f_2$ before and after Reconcile. Model construction follows methodologies described in Section \ref{['sec:building models']}. Model specifics can be found in Section \ref{['sec:experiments Reconcile results']}.
  • Figure 3: Heatmap showing the Mean Squared Error (MSE) across different numbers of random models for each dataset and aggregation method.
  • Figure 4: Heatmap showing MSE values for majority and minority race groups across two models before and after applying Reconcile, for all datasets. In this experiment, random predictions were assigned to the second model for the minority subgroup prior to Reconcile to simulate a scenario where one model underperforms on a specific subgroup.
  • Figure 5: Disagreements between CATE estimates before and after running Reconcile on Twins dataset.
  • ...and 9 more figures

Theorems & Definitions (21)

  • Theorem 3.1: Roth2022-sd
  • Lemma 4.1
  • proof
  • Lemma 4.2
  • proof
  • Definition 4.3
  • Lemma 4.4: Relationship between MA and Reconcile
  • Lemma 5.1
  • proof
  • Definition A.1: $\alpha$-approx. group cond. mean consistency
  • ...and 11 more