Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize

Sanket Shah; Andrew Perrault; Bryan Wilder; Milind Tambe

Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize

Sanket Shah, Andrew Perrault, Bryan Wilder, Milind Tambe

TL;DR

This paper tackles the challenge of making predictive models more decision-focused within Predict-then-Optimize by introducing Efficient Global Losses (EGLs). EGLs combine feature-based parameterization (FBP) to map prediction features to convex loss parameters with model-based sampling (MBS) to generate diverse, realistic training samples, addressing Fisher Consistency and sample efficiency. The authors prove that traditional weighted losses can fail Fisher Consistency in PtO, while a WeightedMSE variant under FBP recovers it for linear objectives. Empirically, EGLs achieve state-of-the-art results across four PtO domains with an order-of-magnitude fewer training samples and exhibit substantial robustness when the localness assumption is violated, highlighting the practical viability of decision-focused learning. The work also analyzes computational trade-offs, showing significant speedups driven by improved sample efficiency and targeted sampling strategies, making decision-focused training more accessible in practice.

Abstract

Predict-then-Optimize is a framework for using machine learning to perform decision-making under uncertainty. The central research question it asks is, "How can the structure of a decision-making task be used to tailor ML models for that specific task?" To this end, recent work has proposed learning task-specific loss functions that capture this underlying structure. However, current approaches make restrictive assumptions about the form of these losses and their impact on ML model behavior. These assumptions both lead to approaches with high computational cost, and when they are violated in practice, poor performance. In this paper, we propose solutions to these issues, avoiding the aforementioned assumptions and utilizing the ML model's features to increase the sample efficiency of learning loss functions. We empirically show that our method achieves state-of-the-art results in four domains from the literature, often requiring an order of magnitude fewer samples than comparable methods from past work. Moreover, our approach outperforms the best existing method by nearly 200% when the localness assumption is broken.

Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize

TL;DR

Abstract

Paper Structure (22 sections, 3 theorems, 10 equations, 3 figures, 7 tables)

This paper contains 22 sections, 3 theorems, 10 equations, 3 figures, 7 tables.

Introduction
Background
Predict-then-Optimize
Task-Specific Loss Functions
Related Work
Part One: Feature-based Parameterization
Fisher Consistency
Part Two: Model-based Sampling
Localness of Predictions
Experiments
Overall Results
Computational Complexity Experiments
Ablation Study
Proof of \ref{['lemma:weights']}
Numerical Details for the Counter-Example in \ref{['sec:example']}
...and 7 more sections

Key Result

Theorem 4.2

Weighting-the-MSE losses are not Fisher Consistent for Predict-then-Optimize problems in which the optimization function $\bm{z}^*$ has a linear objective.

Figures (3)

Figure 2: Cubic Top-K Domain.The underlying mapping from $x \to y$ is given by the green dashed line. The set $\bm{y}$ consists of $N=50$ points where $x_n \sim U[-1, 1]$, and the goal is to predict the point with the largest $y$. The linear model that minimizes the MSE loss is given in blue.
Figure 3: Visualizing Sampling Strategies for the Cubic Top-K Domain.The points in green represent the true labels for some instance $\bm{y}$ with the dashed curve representing the underlying mapping $x \to y$. The points in orange and red each represent a set of sampled predictions $\bm{\Tilde{y}}_{\text{orange}}$ and $\bm{\Tilde{y}}_{\text{red}}$ with the larger point denoting the sampled prediction with the maximum value.
Figure 4: The amount of time taken to create the dataset used to train the loss functions vs. the number of samples per instance $\bm{y}$ in that dataset. To make the results comparable across different experimental domains, we divide the actual generation time by the time taken to generate a dataset containing 32 samples for each domain. We see that for the Web Advertising and Portfolio Optimization domains, the cost scales roughly linearly with the number of samples. For the Cubic Top-K domain, the decision-making problem (top-k) is trivial and the computation is determined by other overheads, leading to a near-constant generation time.

Theorems & Definitions (7)

Definition 4.1: Fisher Consistency
Theorem 4.2
proof
Lemma 4.3
Theorem 4.4
proof
proof

Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize

TL;DR

Abstract

Leaving the Nest: Going Beyond Local Loss Functions for Predict-Then-Optimize

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)