Table of Contents
Fetching ...

Bayesian Off-Policy Evaluation and Learning for Large Action Spaces

Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba

TL;DR

This work tackles off-policy evaluation and learning in contextual bandits with very large action spaces by introducing sDM, a Bayesian structured direct method that exploits a latent variable $\boldsymbol{\u03c1}$ to share reward information across actions. In the linear-Gaussian setting, sDM yields closed-form posterior updates for $\theta_a$ and the latent $\boldsymbol{\u03c1}$, enabling scalable inference and a simple greedy learning rule. The authors formalize Bayesian evaluation metrics, including Bayesian suboptimality (BSO) and Bayesian mean squared error (BMSE), and prove a covariance-dependent bound showing when greedy policies minimize BSO. Empirically, sDM outperforms baselines on synthetic data and real-world datasets (MovieLens, KuaiRec), even under likelihood misspecification, demonstrating data efficiency and robustness in large action spaces. They also discuss the practical limitations of assuming well-specified priors and outline directions for extending the framework to non-linear hierarchies and richer priors.

Abstract

In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.

Bayesian Off-Policy Evaluation and Learning for Large Action Spaces

TL;DR

This work tackles off-policy evaluation and learning in contextual bandits with very large action spaces by introducing sDM, a Bayesian structured direct method that exploits a latent variable to share reward information across actions. In the linear-Gaussian setting, sDM yields closed-form posterior updates for and the latent , enabling scalable inference and a simple greedy learning rule. The authors formalize Bayesian evaluation metrics, including Bayesian suboptimality (BSO) and Bayesian mean squared error (BMSE), and prove a covariance-dependent bound showing when greedy policies minimize BSO. Empirically, sDM outperforms baselines on synthetic data and real-world datasets (MovieLens, KuaiRec), even under likelihood misspecification, demonstrating data efficiency and robustness in large action spaces. They also discuss the practical limitations of assuming well-specified priors and outline directions for extending the framework to non-linear hierarchies and richer priors.

Abstract

In interactive systems, actions are often correlated, presenting an opportunity for more sample-efficient off-policy evaluation (OPE) and learning (OPL) in large action spaces. We introduce a unified Bayesian framework to capture these correlations through structured and informative priors. In this framework, we propose sDM, a generic Bayesian approach for OPE and OPL, grounded in both algorithmic and theoretical foundations. Notably, sDM leverages action correlations without compromising computational efficiency. Moreover, inspired by online Bayesian bandits, we introduce Bayesian metrics that assess the average performance of algorithms across multiple problem instances, deviating from the conventional worst-case assessments. We analyze sDM in OPE and OPL, highlighting the benefits of leveraging action correlations. Empirical evidence showcases the strong performance of sDM.
Paper Structure (43 sections, 7 theorems, 107 equations, 16 figures)

This paper contains 43 sections, 7 theorems, 107 equations, 16 figures.

Key Result

Theorem 5.1

Let $\pi_*(x)$ be the optimal action for context $x$. Then the BSO of $\texttt{sDM}$ under the structured prior eq:contextual_gaussian_model satisfies

Figures (16)

  • Figure 1: Graph representation of the structured prior.
  • Figure 2: Performance of $\texttt{sDM}$ and baselines on synthetic and MovieLens problems.
  • Figure 3: $\texttt{sDM}$vs.DM (Bayes) for varying $K$.
  • Figure 4: Effect of Misspecification.
  • Figure 5: OPL Results on KuaiRec.
  • ...and 11 more figures

Theorems & Definitions (16)

  • Remark 4.1
  • Theorem 5.1: Covariance-Dependent Bound
  • Theorem 5.2: Scaling with $n$
  • Theorem 5.3: OPE Result
  • proof : Derivation of $\, p(\theta_a \mid S)$ for the standard prior in \ref{['eq:basic_model']}
  • proof : Derivation of $\, p(\psi \mid S)$
  • proof : Derivation of $\, p(\theta_a \mid \psi, S)$
  • proof : Derivation of $\, p(\theta_a \mid S)$
  • Lemma E.1: Bayesian bound
  • proof
  • ...and 6 more