Table of Contents
Fetching ...

Reconciling Model Multiplicity for Downstream Decision Making

Ally Yalei Du, Dung Daniel Ngo, Zhiwei Steven Wu

TL;DR

This paper addresses predictive multiplicity, where two models with similar accuracy induce different best-response actions for downstream losses. It introduces ReDCal, a two-stage, multi-calibration-based reconciliation framework that (i) aligns individual probability predictions on disagreement regions and (ii) enforces decision calibration so BR actions reflect the true downstream costs. The authors prove that ReDCal reduces Brier scores, preserves or modestly increases downstream performance, and minimizes BR-action disagreements, with finite-sample guarantees extending the theory to empirical data. Empirically, ReDCal outperforms prior reconciliation methods on ImageNet and HAM10000 by achieving lower decision losses and greater alignment in BR decisions, demonstrating practical impact for high-dimensional decision-making tasks.

Abstract

We consider the problem of model multiplicity in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on the best-response action for a downstream loss function. We show that even when the two predictive models approximately agree on their individual predictions almost everywhere, it is still possible for their induced best-response actions to differ on a substantial portion of the population. We address this issue by proposing a framework that calibrates the predictive models with regard to both the downstream decision-making problem and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true probability distribution to the decision-maker. We extend our results to the setting where one does not have direct access to the true probability distribution and instead relies on a set of i.i.d data to be the empirical distribution. Finally, we provide a set of experiments to empirically evaluate our methods: compared to existing work, our proposed algorithm creates a pair of predictive models with both improved downstream decision-making losses and agrees on their best-response actions almost everywhere.

Reconciling Model Multiplicity for Downstream Decision Making

TL;DR

This paper addresses predictive multiplicity, where two models with similar accuracy induce different best-response actions for downstream losses. It introduces ReDCal, a two-stage, multi-calibration-based reconciliation framework that (i) aligns individual probability predictions on disagreement regions and (ii) enforces decision calibration so BR actions reflect the true downstream costs. The authors prove that ReDCal reduces Brier scores, preserves or modestly increases downstream performance, and minimizes BR-action disagreements, with finite-sample guarantees extending the theory to empirical data. Empirically, ReDCal outperforms prior reconciliation methods on ImageNet and HAM10000 by achieving lower decision losses and greater alignment in BR decisions, demonstrating practical impact for high-dimensional decision-making tasks.

Abstract

We consider the problem of model multiplicity in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on the best-response action for a downstream loss function. We show that even when the two predictive models approximately agree on their individual predictions almost everywhere, it is still possible for their induced best-response actions to differ on a substantial portion of the population. We address this issue by proposing a framework that calibrates the predictive models with regard to both the downstream decision-making problem and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true probability distribution to the decision-maker. We extend our results to the setting where one does not have direct access to the true probability distribution and instead relies on a set of i.i.d data to be the empirical distribution. Finally, we provide a set of experiments to empirically evaluate our methods: compared to existing work, our proposed algorithm creates a pair of predictive models with both improved downstream decision-making losses and agrees on their best-response actions almost everywhere.
Paper Structure (35 sections, 19 theorems, 81 equations, 6 figures, 3 algorithms)

This paper contains 35 sections, 19 theorems, 81 equations, 6 figures, 3 algorithms.

Key Result

Lemma 2.2

Fix any distribution $\mathcal{D}$ and let $f^*(x) = \mathbb E_{(x,y) \sim \mathcal{D}}[y | x]$ represent the true conditional label encoded by $\mathcal{D}$. Let $f: \mathcal{X} \rightarrow [0,1]^d$ be any other model. Then we have $B(f^*, \mathcal{D}) \leq B(f, \mathcal{D})$.

Figures (6)

  • Figure 1: An illustrative example of the drawback in a prior work's attempt at addressing model multiplicity. Consider a stylized binary classification problem on a dataset with $8$ units (patients) and the hospital deciding between two actions (treatment vs. no treatment). Treatment is assigned if the predicted probability is above 1/2. Left: The true probability that each patient is labeled 'ill'. Middle: The predicted probability that each patient is ill according to $f_1$ (white) and $f_2$ (blue). While these two predictors have almost the same accuracy, their individual probability predictions for patients $3$ and $6$ vastly differ. Right: After running the Reconcile procedure of roth2023reconciling, the individual probability predictions agree everywhere. However, the best-response action of unit $3$ changed from correct (no treatment) to incorrect (treatment). If the hospital uses the updated $f_1$ to make their treatment recommendation, they would incur more loss than before had they not updated the predictor using Reconcile. This example is formalized in \ref{['thm:counterexample-reconcile']}.
  • Figure 2: ReDCal decreases Brier score on Imagenet. Compared to Reconcile, our algorithm decreases the Brier score by a smaller amount on the test dataset. Decision-Calibration with ReDCal as post-process achieves the most substantial decrease in the Brier score.
  • Figure 3: In \ref{['fig:imagenet_train_loss']} and \ref{['fig:imagenet_test_loss']}, we plot the gap between optimal loss had we know the true label $y$ and the loss from taking best-response actions induced by the calibrated predictors on the validation set and test set, respectively. In the left two figures, we compare \ref{['alg: decision_cali']} (orange) with \ref{['alg: reconcile-brier']} (blue). While the average loss of predictors updated using \ref{['alg: reconcile-brier']} may increase on the test set, our algorithm quickly converges and produces predictors with lower decison-making loss. In the right two figures, we compare \ref{['alg: decision_cali']} (green) to \ref{['alg: decision_cali']} with an additional run of \ref{['alg: reconcile']} (red) as post-process. We observe that running our algorithm as post-process can still further decrease the loss compared to just running \ref{['alg: decision_cali']} on its own. Results are averaged over $10$ runs and the shaded region indicates $\pm 1$ standard errors.
  • Figure 4: ReDCal decreases decision loss on Imagenet. The takeaway results are similar to \ref{['fig:imagenet_test_loss']}. As the number of classes in the multi-class classification problem grows from $10$ to $1000$, ReDCal still outperforms Reconcile in decreasing decision loss on the test dataset. When we have $1000$ classes, ReDCal converges slower than Reconcile. Furthermore, ReDCal can further decrease the decision loss when it is used as a post-process after Decision Calibration terminates.
  • Figure 5: Brier score of the updated predictors using \ref{['alg: reconcile']} (orange) and two benchmark algorithms: \ref{['alg: reconcile-brier']} (dashed-blue) and \ref{['alg: decision_cali']} (dashed-green). Our algorithm reduces the Brier score by a smaller amount compared to \ref{['alg: reconcile']}. Results are averaged over $10$ runs and the shaded region indicates $\pm 1$ standard error.
  • ...and 1 more figures

Theorems & Definitions (41)

  • Definition 2.1: Brier Score
  • Lemma 2.2
  • Definition 2.3: Best-response policy
  • Definition 2.4: Best-response Events
  • Definition 2.5: $\beta$-approximate decision calibration
  • Lemma 2.6: zhao2021calibrating
  • Theorem 2.7
  • proof
  • Theorem 2.8
  • proof
  • ...and 31 more