Decision-Focused Evaluation of Worst-Case Distribution Shift

Kevin Ren; Yewon Byun; Bryan Wilder

Decision-Focused Evaluation of Worst-Case Distribution Shift

Kevin Ren, Yewon Byun, Bryan Wilder

TL;DR

This work addresses how distribution shift affects downstream resource-allocation decisions by proposing a two-level hierarchical generative model that captures shifts across and within optimization instances. It reformulates the worst-case shift problem as a DR-submodular optimization and solves it with a non-monotone Frank-Wolfe algorithm enhanced with momentum, enabling scalable approximation. Empirically, worst-case distributions identified under a given metric often diverge from those identified under other metrics, highlighting that decision-focused robustness must align with the allocation task rather than per-instance accuracy alone. The approach demonstrates substantial efficiency and robustness improvements over standard polynomial solvers and reveals important practical implications for deploying ML in high-stakes, allocation-based settings.

Abstract

Distribution shift is a key challenge for predictive models in practice, creating the need to identify potentially harmful shifts in advance of deployment. Existing work typically defines these worst-case shifts as ones that most degrade the individual-level accuracy of the model. However, when models are used to make a downstream population-level decision like the allocation of a scarce resource, individual-level accuracy may be a poor proxy for performance on the task at hand. We introduce a novel framework that employs a hierarchical model structure to identify worst-case distribution shifts in predictive resource allocation settings by capturing shifts both within and across instances of the decision problem. This task is more difficult than in standard distribution shift settings due to combinatorial interactions, where decisions depend on the joint presence of individuals in the allocation task. We show that the problem can be reformulated as a submodular optimization problem, enabling efficient approximations of worst-case loss. Applying our framework to real data, we find empirical evidence that worst-case shifts identified by one metric often significantly diverge from worst-case distributions identified by other metrics.

Decision-Focused Evaluation of Worst-Case Distribution Shift

TL;DR

Abstract

Paper Structure (17 sections, 3 theorems, 24 equations, 5 figures, 1 algorithm)

This paper contains 17 sections, 3 theorems, 24 equations, 5 figures, 1 algorithm.

Introduction
Problem Setup
Methods
Experiments and Results
Loss functions
Experimental Setup
Results
Main Results
Analysis of Optimization Instances
Efficiency of the Frank-Wolfe Algorithm
Discussion
Acknowledgements
Related Work
Full Proofs of DR-submodularity
Additional Methodological Details
...and 2 more sections

Key Result

Theorem 3.1

Suppose we have a solution $W$ to Equation eq:eq2 with value at least $\alpha \cdot OPT'_W - \epsilon$ for some $\alpha \in \mathbb{R}, \epsilon \in \mathbb{R}$, where $OPT'_W$ is the optimal value. $W$ corresponds to a $Q$ with value at least $\alpha \cdot OPT_Q - \epsilon$ where $OPT_Q$ is the op

Figures (5)

Figure 1: Diagonal-normalized aggregated heat maps over states for models trained with CE loss (in the regression case, MSE loss) (top row) and SPO loss (bottom row). From left to right in each row, results are displayed by task for (a,d) employment classification, (b,e) income classification, and (c,f) income regression. Within each heat map, rows denote the metric the worst-case distribution maximizes, and columns denote the metrics the worst-case distribution was evaluated on. Note that each column is divided by the diagonal entry in that column, resulting in a main diagonal of all 1.0. Since CE loss is always negative, each entry in columns corresponding to CE loss is equal to the diagonal entry in that column divided by the original loss in that cell. The strong main diagonals here accentuate our observation that worst-case distributions w.r.t. a given metric tend to 'overfit' on that metric. Note that cross-entropy and accuracy are used only in binary classification tasks, and mean-squared error and utility-based loss are used only in the income regression task.
Figure 2: Plots of individuals in an optimization instance in the employment prediction task, for worst-case distributions w.r.t. (a) CE and the (b) fairness-based loss. The underlying predictive model is trained with CE loss. For each worst-case distribution w.r.t. the metric of interest, we display, for all individuals, their model predictions, assigned weights, and education level, with subplots for each label. The color bar denotes the weight in the worst-case distribution and differently-shaped points represent individuals of different races (circle for white, square for non-white).
Figure 3: Plots of individuals in an optimization instance in the income regression task, from the perspective of worst-case distributions w.r.t. (a) MSE and the (b) utility-based metric. The underlying predictive model is trained with mean-squared error loss. For each worst-case distribution, we display over all individuals their model predictions, label income, and assigned weights. In each figure the identity line ($\text{True Income} = \text{Model Prediction}$) is marked with a dotted line.
Figure 4: Aggregate results of efficiency experiment. The converged expected values of loss are plotted over all metrics and all datasets. Here $n_j = 8$, and we calculate reference worst-case expected values of each metric using Pyomo with the IPOPT solver. All values are normalized w.r.t. the value of the Pyomo/IPOPT solution. The bolded flat line at $y=1$ represents the Pyomo/IPOPT solutions. The colored lines each represent, for a given predictive model training method, prediction task, and metric, how closely our method asymptotically reaches the solution quality of the Pyomo/IPOPT solutions. We find that the majority of our curves converged asymptotically to over 80% of the Pyomo/IPOPT solution.
Figure 5: Diagonal-normalized aggregated heat maps with 95% confidence intervals over states for models trained with CE loss (in the regression case, mean-squared error) (top row) and SPO loss (bottom row). From left to right in each row, results are displayed by task for (a,d) employment classification, (b,e income classification, and (c,f) income regression. Within each heat map, rows denote the metric the worst-case distribution maximizes, and columns denote the metrics the worst-case distribution was evaluated on. Note that each column is divided by the diagonal entry in that column, resulting in a main diagonal of all 1.0. Furthermore, because CE loss is always negative, each entry in columns corresponding to CE loss is equal to the diagonal entry in that column divided by the original loss in that cell.

Theorems & Definitions (8)

Theorem 3.1
Definition 3.2
Theorem 3.3
Definition B.1
Lemma B.2
proof
proof
proof

Decision-Focused Evaluation of Worst-Case Distribution Shift

TL;DR

Abstract

Decision-Focused Evaluation of Worst-Case Distribution Shift

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (8)