Contextual Linear Optimization with Partial Feedback
Yichun Hu, Nathan Kallus, Xiaojie Mao, Yanchen Wu
TL;DR
This work tackles contextual linear optimization under partial feedback, focusing on bandit and semi-bandit settings where only partial cost information is observed. It introduces a unified Induced Empirical Risk Minimization (IERM) framework that uses score functions to evaluate expected policy cost from partial data and employs cross-fitting to estimate nuisance quantities. The authors establish fast-rate regret bounds under model misspecification, governed by a margin condition and a local critical radius, and show how surrogate losses from full-feedback CLO (e.g., SPO+-type losses) adapt to partial feedback for scalable optimization. They also extend the framework to semi-bandit feedback and validate the approach with extensive synthetic and real-world (Uber Movement) experiments, demonstrating robustness of the end-to-end IERM approach under partial information and misspecification, particularly in the challenging bandit regime.
Abstract
Contextual linear optimization (CLO) uses predictive contextual features to reduce uncertainty in random cost coefficients in the objective and thereby improve decision-making performance. A canonical example is the stochastic shortest path problem with random edge costs (e.g., travel time) and contextual features (e.g., lagged traffic, weather). While existing work on CLO assumes fully observed cost coefficient vectors, in many applications the decision maker observes only partial feedback corresponding to each chosen decision in the history. In this paper, we study both a bandit-feedback setting (e.g., only the overall travel time of each historical path is observed) and a semi-bandit-feedback setting (e.g., travel times of the individual segments on each chosen path are additionally observed). We propose a unified class of offline learning algorithms for CLO with different types of feedback, following a powerful induced empirical risk minimization (IERM) framework that integrates estimation and optimization. We provide a novel fast-rate regret bound for IERM that allows for misspecified model classes and flexible choices of estimation methods. To solve the partial-feedback IERM, we also tailor computationally tractable surrogate losses. A byproduct of our theory of independent interest is the fast-rate regret bound for IERM with full feedback and a misspecified policy class. We compare the performance of different methods numerically using stochastic shortest path examples on simulated and real data and provide practical insights from the empirical results.
