Decision-Focused Learning with Directional Gradients

Michael Huang; Vishal Gupta

Decision-Focused Learning with Directional Gradients

Michael Huang, Vishal Gupta

TL;DR

This work proposes a novel family of decision-aware surrogate losses, called Perturbation Gradient losses, for the predict-then-optimize framework, and provides numerical evidence confirming that these losses substantively outperform existing proposals when the underlying model is misspecified.

Abstract

We propose a novel family of decision-aware surrogate losses, called Perturbation Gradient (PG) losses, for the predict-then-optimize framework. The key idea is to connect the expected downstream decision loss with the directional derivative of a particular plug-in objective, and then approximate this derivative using zeroth order gradient techniques. Unlike the original decision loss which is typically piecewise constant and discontinuous, our new PG losses is a Lipschitz continuous, difference of concave functions that can be optimized using off-the-shelf gradient-based methods. Most importantly, unlike existing surrogate losses, the approximation error of our PG losses vanishes as the number of samples grows. Hence, optimizing our surrogate loss yields a best-in-class policy asymptotically, even in misspecified settings. This is the first such result in misspecified settings, and we provide numerical evidence confirming our PG losses substantively outperform existing proposals when the underlying model is misspecified.

Decision-Focused Learning with Directional Gradients

TL;DR

Abstract

Paper Structure (27 sections, 11 theorems, 56 equations, 9 figures)

This paper contains 27 sections, 11 theorems, 56 equations, 9 figures.

Introduction
Contributions
Related Work
Notation and Preliminaries
A New Family of Surrogate Losses
Properties of PG Losses
Performance Guarantees
Expected Approximation Error
Uniform Error Bounds
Excess Regret Bounds
Numerical Experiments
Simple Misspecification Experiment
Shortest Path Experiments
Portfolio Experiment
Conclusion
...and 12 more sections

Key Result

Lemma 2.1

[lemma]lem:SL-properties Suppose assn:Boundedness holds. For any $t, t^\prime \in \mathbb{R}^d$ and $y \in \mathcal{Y}$, the PG losses are Finally, the backward difference upperbounds the true loss, i.e., $\ell(t, y) \le \hat{\ell}^b(t, y).$

Figures (9)

Figure 1: (Convergence under Misspecification). Excess regret normalized by optimal policy's performance under the misspecified setting of \ref{['sec:simple-miss']} ($\alpha = 1$, $m=0$). PGB is our proposed loss. ETO is a decision-blind approach that minimizes MSE. Other benchmarks include: DBB poganvcic2019differentiation, FYL berthet2020learning, and SPO+ elmachtoub2022smart. Under misspecification, only the PG losses have vanishing excess regret. Error bars are $95\%$ confidence intervals on the mean over 100 trials.
Figure 2: (Comparing Surrogates under Misspecification). See \ref{['sec:simple-miss']} for setup ($\alpha = 1$, $m=0$). Benchmarks are decision-loss (DL) $\ell$, our PGB and PGC losses, Fenchel-Young Loss (FYL) berthet2020learning, SPO+ elmachtoub2022smart, and the learning-to-rank list loss (DBLP:journals/corr/abs-2112-03609. Left-panel: ($n=200$) Only our PG losses closely track the DL. Right Panels: As $n$ increases, the DL and PG losses both become smoother.
Figure 3: (SPO+ Comparisons) The left figure plots the excess regret normalized by the optimal policy's performance as we vary $m$ for $n=80$ and $\alpha = 1$. The right figure plots the same value as we vary $\alpha$ for $n=200$. When $\alpha = 0$ the noise is centrally symmetric and when $\alpha = 1$ the noise is the most asymmetric. Error bars are $95\%$ confidence intervals on the mean over 100 trials.
Figure 4: Harder Shortest Path. a) One of the two planted paths will be optimal depending on value of $X_6$. All other arcs strictly worse. b) Normalized Excess Regret as we vary the training samples. Error bars are 95% confidence intervals on the mean over $100$ trials.
Figure 5: (Portfolio Optimization) We plot the excess regret normalized by optimal policy's performance as we vary the number of training samples. Error bars are $95\%$ confidence intervals on the mean over 100 trials.
...and 4 more figures

Theorems & Definitions (22)

Lemma 2.1: Basic Properties
Lemma 2.2: Informative Gradients
Lemma 3.2: Expected Approximation Error
Corollary 3.3: Pointwise Approximation Error
Theorem 3.4: Uniform Error Bound for General $\mathcal{Z}$
Definition 3.5: VC-Linear-Subgraph Dimension
Theorem 3.7: Uniform Error Bound for Polyhedral $\mathcal{Z}$
Theorem 3.8: Excess Regret Bounds
proof
proof
...and 12 more

Decision-Focused Learning with Directional Gradients

TL;DR

Abstract

Decision-Focused Learning with Directional Gradients

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (22)