Reconciling Model Multiplicity for Downstream Decision Making
Ally Yalei Du, Dung Daniel Ngo, Zhiwei Steven Wu
TL;DR
This paper addresses predictive multiplicity, where two models with similar accuracy induce different best-response actions for downstream losses. It introduces ReDCal, a two-stage, multi-calibration-based reconciliation framework that (i) aligns individual probability predictions on disagreement regions and (ii) enforces decision calibration so BR actions reflect the true downstream costs. The authors prove that ReDCal reduces Brier scores, preserves or modestly increases downstream performance, and minimizes BR-action disagreements, with finite-sample guarantees extending the theory to empirical data. Empirically, ReDCal outperforms prior reconciliation methods on ImageNet and HAM10000 by achieving lower decision losses and greater alignment in BR decisions, demonstrating practical impact for high-dimensional decision-making tasks.
Abstract
We consider the problem of model multiplicity in downstream decision-making, a setting where two predictive models of equivalent accuracy cannot agree on the best-response action for a downstream loss function. We show that even when the two predictive models approximately agree on their individual predictions almost everywhere, it is still possible for their induced best-response actions to differ on a substantial portion of the population. We address this issue by proposing a framework that calibrates the predictive models with regard to both the downstream decision-making problem and the individual probability prediction. Specifically, leveraging tools from multi-calibration, we provide an algorithm that, at each time-step, first reconciles the differences in individual probability prediction, then calibrates the updated models such that they are indistinguishable from the true probability distribution to the decision-maker. We extend our results to the setting where one does not have direct access to the true probability distribution and instead relies on a set of i.i.d data to be the empirical distribution. Finally, we provide a set of experiments to empirically evaluate our methods: compared to existing work, our proposed algorithm creates a pair of predictive models with both improved downstream decision-making losses and agrees on their best-response actions almost everywhere.
