Table of Contents
Fetching ...

Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching

Hamsa Bastani, Osbert Bastani, Bryce McLaughlin

TL;DR

The paper demonstrates that model-based policy evaluation in data-driven decision-making can produce large, spurious gains due to the winner’s curse, even when common justifications (model accuracy, randomization, well-specified models, and sample splitting) are present. Through two strands—theoretical analyses of misspecification and regularization, and a refugee-matching case study with a controlled synthetic environment—the authors show that model-based estimates can overly optimisticly claim benefits (around 60% in their refugee scenario) despite zero true effect. The work surveys current literature and finds widespread reliance on model-based evaluation without valid uncertainty accounting, underscoring the need for alternative methods that guarantee validity, lower variance, or significance-aware evaluation. Practically, the findings urge caution in deploying data-driven assignment policies and motivate methods that integrate real-world outcomes or rely on model-free evaluation with variance control, to ensure reported gains reflect true policy improvements.

Abstract

A major challenge in data-driven decision-making is accurate policy evaluation-i.e., guaranteeing that a learned decision-making policy achieves the promised benefits. A popular strategy is model-based policy evaluation, which estimates a model from data to infer counterfactual outcomes. This strategy is known to produce unwarrantedly optimistic estimates of the true benefit due to the winner's curse. We searched the recent literature on data-driven decision-making, identifying a sample of 55 papers published in the Management Science in the past decade; all but two relied on this flawed methodology. Several common justifications are provided: (1) the estimated models are accurate, stable, and well-calibrated, (2) the historical data uses random treatment assignment, (3) the model family is well-specified, and (4) the evaluation methodology uses sample splitting. Unfortunately, we show that no combination of these justifications avoids the winner's curse. First, we provide a theoretical analysis demonstrating that the winner's curse can cause large, spurious reported benefits even when all these justifications hold. Second, we perform a simulation study based on the recent and consequential data-driven refugee matching problem. We construct a synthetic refugee matching environment (calibrated to closely match the real setting) but designed so that no assignment policy can improve expected employment compared to random assignment. Model-based methods report large, stable gains of around 60% even when the true effect is zero; these gains are on par with improvements of 22-75% reported in the literature. Our results provide strong evidence against model-based evaluation.

Winner's Curse Drives False Promises in Data-Driven Decisions: A Case Study in Refugee Matching

TL;DR

The paper demonstrates that model-based policy evaluation in data-driven decision-making can produce large, spurious gains due to the winner’s curse, even when common justifications (model accuracy, randomization, well-specified models, and sample splitting) are present. Through two strands—theoretical analyses of misspecification and regularization, and a refugee-matching case study with a controlled synthetic environment—the authors show that model-based estimates can overly optimisticly claim benefits (around 60% in their refugee scenario) despite zero true effect. The work surveys current literature and finds widespread reliance on model-based evaluation without valid uncertainty accounting, underscoring the need for alternative methods that guarantee validity, lower variance, or significance-aware evaluation. Practically, the findings urge caution in deploying data-driven assignment policies and motivate methods that integrate real-world outcomes or rely on model-free evaluation with variance control, to ensure reported gains reflect true policy improvements.

Abstract

A major challenge in data-driven decision-making is accurate policy evaluation-i.e., guaranteeing that a learned decision-making policy achieves the promised benefits. A popular strategy is model-based policy evaluation, which estimates a model from data to infer counterfactual outcomes. This strategy is known to produce unwarrantedly optimistic estimates of the true benefit due to the winner's curse. We searched the recent literature on data-driven decision-making, identifying a sample of 55 papers published in the Management Science in the past decade; all but two relied on this flawed methodology. Several common justifications are provided: (1) the estimated models are accurate, stable, and well-calibrated, (2) the historical data uses random treatment assignment, (3) the model family is well-specified, and (4) the evaluation methodology uses sample splitting. Unfortunately, we show that no combination of these justifications avoids the winner's curse. First, we provide a theoretical analysis demonstrating that the winner's curse can cause large, spurious reported benefits even when all these justifications hold. Second, we perform a simulation study based on the recent and consequential data-driven refugee matching problem. We construct a synthetic refugee matching environment (calibrated to closely match the real setting) but designed so that no assignment policy can improve expected employment compared to random assignment. Model-based methods report large, stable gains of around 60% even when the true effect is zero; these gains are on par with improvements of 22-75% reported in the literature. Our results provide strong evidence against model-based evaluation.
Paper Structure (34 sections, 13 theorems, 63 equations, 7 figures, 3 tables)

This paper contains 34 sections, 13 theorems, 63 equations, 7 figures, 3 tables.

Key Result

Lemma 1

The ordinary least squares (OLS) estimate of $\beta\in\mathbb{R}$ is

Figures (7)

  • Figure 1: Illustration of stylized example. The $x$-axis is the treatment $t$, the $y$-axis is the performance outcome $y$, the solid black line is $f^*(t)$, the dashed black line is $\hat{f}(t)$, the green star denotes $f^*(t^*)$ at $t^*=\operatorname*{\arg\max}_{t\in[0,1]}f^*(t)$, the red circle denotes $\hat{f}(\hat{t})$ at $\hat{t}=\operatorname*{\arg\max}_{t\in[0,1]}\hat{f}(t)$, and the red line denotes the optimistic bias $\hat{f}(\hat{t})-f^*(\hat{t})$ due to the winner's curse.
  • Figure 2: Histograms comparing 250 evaluations according to the prediction model used to optimize the policy with IPW estimates. All prediction models exhibit bias's due to the winner's curse while IPW estimation suffers from excessive variance. All evaluations are reported in percent change in employment rate relative to the observed employment rate in the testing dataset.
  • Figure 3: Histograms comparing 250 evaluations according to prediction models built off bootstrapped samples of the training dataset with IPW estimates and direct evaluation by the prediction model which optimized assignments. The stability of the LASSO estimator leads the bootstrapped samples to exhibit a larger winner's curse than the original prediction model while GBM exhibits a reduced, but still quite large, winner's curse. Honest Random Forests were able to eliminate most of their winner's curse bias in this simulation, although in alternate setups we have observed sizable inaccuracies (both positive and negative) indicating the bootstrapping approach cannot generate trustworthy confidence intervals. All evaluations are reported in percent change in employment rate relative to the observed employment rate in the testing dataset.
  • Figure EC.1: Probability of each location being assigned to each refugee sample, sorted in ascending order. This distribution follows the empirical distribution reported in bansak2018improving.
  • Figure EC.2: The distribution of employment rates across location have been tuned to resemble those in bansak2018improving to maximize similarities between the simulation environment and real world data.
  • ...and 2 more figures

Theorems & Definitions (25)

  • Lemma 1
  • Proposition 1
  • Proposition 2
  • Definition 1
  • Proposition 3
  • Definition EC.1
  • Definition EC.2
  • Lemma EC.1
  • proof
  • Lemma EC.2
  • ...and 15 more