Table of Contents
Fetching ...

POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition

Yuta Saito, Jihan Yao, Thorsten Joachims

TL;DR

The paper tackles off-policy learning for contextual bandits in large discrete action spaces, where standard policy-gradient and reward-regression methods struggle due to high variance or bias. It introduces POTEC, a two-stage policy decomposition that first selects promising action clusters with a policy-based gradient estimator and then chooses the exact action within a cluster using a regression-based second stage, with a gradient estimator defined by $ abla_{ heta} \widehat{V}_{\mathrm{POTEC}} (\pi_{\theta,\psi}^{overall}; \mathcal{D}) = \frac{1}{n} \sum_{i=1}^n \{ w(x_i,c_{a_i}) (r_i - \hat{f}(x_i,a_i)) s_{\theta}(x_i,c_{a_i}) + \mathbb{E}_{\pi_{\theta}^{1st}(c|x_i)} [ \hat{f}^{\pi_{\psi}^{2nd}} (x_i,c) s_{\theta}(x_i,c) ] \}$, where $\pi_{\theta,\psi}^{overall}(a|x) = \sum_{c} \pi_{\theta}^{1st}(c|x) \pi_{\psi}^{2nd}(a|x,c)$. The framework uses a two-step regression setup to minimize bias and variance in the gradient and establishes local correctness as a condition ensuring unbiasedness. Empirically, POTEC delivers substantial improvements over regression-based and policy-based baselines on synthetic data with known clusters and on real-world extreme classification datasets, demonstrating robustness to clustering quality and reward-noise and offering a scalable solution for large action spaces.

Abstract

We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to excessive bias or variance. To overcome these issues in OPL, we propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition (POTEC). It leverages clustering in the action space and learns two different policies via policy- and regression-based approaches, respectively. In particular, we derive a novel low-variance gradient estimator that enables to learn a first-stage policy for cluster selection efficiently via a policy-based approach. To select a specific action within the cluster sampled by the first-stage policy, POTEC uses a second-stage policy derived from a regression-based approach within each cluster. We show that a local correctness condition, which only requires that the regression model preserves the relative expected reward differences of the actions within each cluster, ensures that our policy-gradient estimator is unbiased and the second-stage policy is optimal. We also show that POTEC provides a strict generalization of policy- and regression-based approaches and their associated assumptions. Comprehensive experiments demonstrate that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.

POTEC: Off-Policy Learning for Large Action Spaces via Two-Stage Policy Decomposition

TL;DR

The paper tackles off-policy learning for contextual bandits in large discrete action spaces, where standard policy-gradient and reward-regression methods struggle due to high variance or bias. It introduces POTEC, a two-stage policy decomposition that first selects promising action clusters with a policy-based gradient estimator and then chooses the exact action within a cluster using a regression-based second stage, with a gradient estimator defined by , where . The framework uses a two-step regression setup to minimize bias and variance in the gradient and establishes local correctness as a condition ensuring unbiasedness. Empirically, POTEC delivers substantial improvements over regression-based and policy-based baselines on synthetic data with known clusters and on real-world extreme classification datasets, demonstrating robustness to clustering quality and reward-noise and offering a scalable solution for large action spaces.

Abstract

We study off-policy learning (OPL) of contextual bandit policies in large discrete action spaces where existing methods -- most of which rely crucially on reward-regression models or importance-weighted policy gradients -- fail due to excessive bias or variance. To overcome these issues in OPL, we propose a novel two-stage algorithm, called Policy Optimization via Two-Stage Policy Decomposition (POTEC). It leverages clustering in the action space and learns two different policies via policy- and regression-based approaches, respectively. In particular, we derive a novel low-variance gradient estimator that enables to learn a first-stage policy for cluster selection efficiently via a policy-based approach. To select a specific action within the cluster sampled by the first-stage policy, POTEC uses a second-stage policy derived from a regression-based approach within each cluster. We show that a local correctness condition, which only requires that the regression model preserves the relative expected reward differences of the actions within each cluster, ensures that our policy-gradient estimator is unbiased and the second-stage policy is optimal. We also show that POTEC provides a strict generalization of policy- and regression-based approaches and their associated assumptions. Comprehensive experiments demonstrate that POTEC provides substantial improvements in OPL effectiveness particularly in large and structured action spaces.
Paper Structure (34 sections, 3 theorems, 46 equations, 12 figures, 4 tables)

This paper contains 34 sections, 3 theorems, 46 equations, 12 figures, 4 tables.

Key Result

Theorem 3.2

(Bias Analysis) If Condition assumption:full_cluster_support is true, the POTEC gradient estimator in Eq. eq:potec-pg has the following bias for some given regression model $\hat{f}(x,a)$, where $a,b \in \mathcal{A}$.

Figures (12)

  • Figure 1: The Two-Stage Off-Policy Learning Procedure of Our POTEC Algorithm, which first forms action clustering $c_a$, and then identifies a promising cluster by the 1st-stage policy $\pi_{\theta}^{1st}$, and finally picks the best action in the cluster by the 2nd-stage policy $\pi_{\psi}^{2nd}$.
  • Figure 2: The POTEC algorithm and local correctness condition generalize policy- and regression-based approaches and their respective conditions about the reward function ($q(x,a)$) estimation.
  • Figure 3: Comparing the test policy value (normalized by $V(\pi_0)$) of the OPL methods, with varying (i) training data sizes, (ii) numbers of actions, and (iii) numbers of (true) clusters, in the synthetic experiment.
  • Figure 4: Comparing the test policy value (normalized by $V(\pi_0)$) of the OPL methods under varying cluster noise ratios.
  • Figure 5: Comparing the test policy value (normalized by $V(\pi_0)$) of the OPL methods under varying accuracies of $\hat{q}$ and $\hat{f}$.
  • ...and 7 more figures

Theorems & Definitions (5)

  • Theorem 3.2
  • Corollary 3.4
  • Proposition 3.5
  • proof
  • proof