POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition
Yuta Saito, Jihan Yao, Thorsten Joachims
TL;DR
The paper tackles off-policy learning for contextual bandits in large discrete action spaces, where standard policy-gradient and reward-regression methods struggle due to high variance or bias. It introduces POTEC, a two-stage policy decomposition that first selects promising action clusters with a policy-based gradient estimator and then chooses the exact action within a cluster using a regression-based second stage, with a gradient estimator defined by $ abla_{ heta} \widehat{V}_{\mathrm{POTEC}} (\pi_{\theta,\psi}^{overall}; \mathcal{D}) = \frac{1}{n} \sum_{i=1}^n \{ w(x_i,c_{a_i}) (r_i - \hat{f}(x_i,a_i)) s_{\theta}(x_i,c_{a_i}) + \mathbb{E}_{\pi_{\theta}^{1st}(c|x_i)} [ \hat{f}^{\pi_{\psi}^{2nd}} (x_i,c) s_{\theta}(x_i,c) ] \}$, where $\pi_{\theta,\psi}^{overall}(a|x) = \sum_{c} \pi_{\theta}^{1st}(c|x) \pi_{\psi}^{2nd}(a|x,c)$. The framework uses a two-step regression setup to minimize bias and variance in the gradient and establishes local correctness as a condition ensuring unbiasedness. Empirically, POTEC delivers substantial improvements over regression-based and policy-based baselines on synthetic data with known clusters and on real-world extreme classification datasets, demonstrating robustness to clustering quality and reward-noise and offering a scalable solution for large action spaces.
Abstract
We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to excessive bias or variance. To overcome these issues in OPL, we propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition (POTEC). It leverages clustering in the action space and learns two different policies via policy- and regression-based approaches, respectively. In particular, we derive a novel low-variance gradient estimator that enables to learn a first-stage policy for cluster selection efficiently via a policy-based approach. To select a specific action within the cluster sampled by the first-stage policy, POTEC uses a second-stage policy derived from a regression-based approach within each cluster. We show that a local correctness condition, which only requires that the regression model preserves the relative expected reward differences of the actions within each cluster, ensures that our policy-gradient estimator is unbiased and the second-stage policy is optimal. We also show that POTEC provides a strict generalization of policy- and regression-based approaches and their associated assumptions. Comprehensive experiments demonstrate that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
