Table of Contents
Fetching ...

Contextual Linear Optimization with Partial Feedback

Yichun Hu, Nathan Kallus, Xiaojie Mao, Yanchen Wu

TL;DR

This work tackles contextual linear optimization under partial feedback, focusing on bandit and semi-bandit settings where only partial cost information is observed. It introduces a unified Induced Empirical Risk Minimization (IERM) framework that uses score functions to evaluate expected policy cost from partial data and employs cross-fitting to estimate nuisance quantities. The authors establish fast-rate regret bounds under model misspecification, governed by a margin condition and a local critical radius, and show how surrogate losses from full-feedback CLO (e.g., SPO+-type losses) adapt to partial feedback for scalable optimization. They also extend the framework to semi-bandit feedback and validate the approach with extensive synthetic and real-world (Uber Movement) experiments, demonstrating robustness of the end-to-end IERM approach under partial information and misspecification, particularly in the challenging bandit regime.

Abstract

Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients in the objective and thereby improve decision-making performance. A canonical example is the stochastic shortest path problem with random edge costs (e.g., travel time) and contextual features (e.g., lagged traffic, weather). While existing work on CLO assumes fully observed cost coefficient vectors, in many applications the decision maker observes only partial feedback corresponding to each chosen decision in the history. In this paper, we study both a bandit-feedback setting (e.g., only the overall travel time of each historical path is observed) and a semi-bandit-feedback setting (e.g., travel times of the individual segments on each chosen path are additionally observed). We propose a unified class of offline learning algorithms for CLO with different types of feedback, following a powerful induced empirical risk minimization (IERM) framework that integrates estimation and optimization. We provide a novel fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of estimation methods. To solve the partial-feedback IERM, we also tailor computationally tractable surrogate losses. A byproduct of our theory of independent interest is the fast-rate regret bound for IERM with full feedback and a misspecified policy class. We compare the performance of different methods numerically using stochastic shortest path examples on simulated and real data and provide practical insights from the empirical results.

Contextual Linear Optimization with Partial Feedback

TL;DR

This work tackles contextual linear optimization under partial feedback, focusing on bandit and semi-bandit settings where only partial cost information is observed. It introduces a unified Induced Empirical Risk Minimization (IERM) framework that uses score functions to evaluate expected policy cost from partial data and employs cross-fitting to estimate nuisance quantities. The authors establish fast-rate regret bounds under model misspecification, governed by a margin condition and a local critical radius, and show how surrogate losses from full-feedback CLO (e.g., SPO+-type losses) adapt to partial feedback for scalable optimization. They also extend the framework to semi-bandit feedback and validate the approach with extensive synthetic and real-world (Uber Movement) experiments, demonstrating robustness of the end-to-end IERM approach under partial information and misspecification, particularly in the challenging bandit regime.

Abstract

Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients in the objective and thereby improve decision-making performance. A canonical example is the stochastic shortest path problem with random edge costs (e.g., travel time) and contextual features (e.g., lagged traffic, weather). While existing work on CLO assumes fully observed cost coefficient vectors, in many applications the decision maker observes only partial feedback corresponding to each chosen decision in the history. In this paper, we study both a bandit-feedback setting (e.g., only the overall travel time of each historical path is observed) and a semi-bandit-feedback setting (e.g., travel times of the individual segments on each chosen path are additionally observed). We propose a unified class of offline learning algorithms for CLO with different types of feedback, following a powerful induced empirical risk minimization (IERM) framework that integrates estimation and optimization. We provide a novel fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of estimation methods. To solve the partial-feedback IERM, we also tailor computationally tractable surrogate losses. A byproduct of our theory of independent interest is the fast-rate regret bound for IERM with full feedback and a misspecified policy class. We compare the performance of different methods numerically using stochastic shortest path examples on simulated and real data and provide practical insights from the empirical results.
Paper Structure (56 sections, 18 theorems, 129 equations, 1 figure, 13 tables)

This paper contains 56 sections, 18 theorems, 129 equations, 1 figure, 13 tables.

Key Result

Lemma 1

Under assump: dgp, the following choices for the score function $\theta(X,Z,C; f, \Sigma)$, when evaluated at the true nuisance functions $f = \tilde{f}^* \in \mathcal{F}^*$ and $\Sigma=\Sigma^*$, all satisfy the policy evaluation identity in eq: theta condition: Here, $\Sigma^\dagger(X)$ denotes the Moore–Penrose pseudo-inverse of matrix $\Sigma(X)$.

Figures (1)

  • Figure 1: Illustrations of full-feedback, bandit-feedback, and semi-bandit-feedback settings for the stochastic shortest path problem on a $5 \times 5$ grid. Highlighted edges indicate the information observed under each feedback model. In the full-feedback setting, the travel times of all $40$ edges in the road network are observed for each data point. In the bandit-feedback setting, only the scalar total travel time of the chosen path is observed. In the semi-bandit-feedback setting, the travel time of each edge along the chosen path is observed. We will conduct synthetic experiments on this road network in \ref{['sec: synthetic']}.

Theorems & Definitions (25)

  • Lemma 1
  • Remark 1: Estimation of $\tilde{f}^*$
  • Remark 2: Estimation of $\Sigma^*$
  • Theorem 1
  • Remark 3
  • Definition 1: VC-Subgraph Dimension
  • Proposition 1
  • Definition 2: Natarajan Dimension
  • Proposition 2
  • Proposition 3
  • ...and 15 more