Table of Contents
Fetching ...

Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba

TL;DR

This work addresses offline policy learning under pessimism by introducing a unified PAC-Bayesian framework that applies to a broad family of regularized importance weights. It derives a tractable two-sided generalization bound for regularized IPS and proposes two learning principles—Bound Optimization and Heuristic Optimization—that are compatible with linear and non-linear IW regularizations. Theoretical results are complemented by experiments on MNIST and Fashion-MNIST, demonstrating that standard IW regularizations (Clip, IX, ES) perform well in OPL and that the proposed PAC-Bayesian approach can surpass or match existing baselines under various logging-policy qualities. Overall, the study provides a generic, comparable framework for evaluating pessimistic learning strategies in offline policy learning, with practical guidance on choosing IW regularizations and optimization schemes.

Abstract

Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.

Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling

TL;DR

This work addresses offline policy learning under pessimism by introducing a unified PAC-Bayesian framework that applies to a broad family of regularized importance weights. It derives a tractable two-sided generalization bound for regularized IPS and proposes two learning principles—Bound Optimization and Heuristic Optimization—that are compatible with linear and non-linear IW regularizations. Theoretical results are complemented by experiments on MNIST and Fashion-MNIST, demonstrating that standard IW regularizations (Clip, IX, ES) perform well in OPL and that the proposed PAC-Bayesian approach can surpass or match existing baselines under various logging-policy qualities. Overall, the study provides a generic, comparable framework for evaluating pessimistic learning strategies in offline policy learning, with practical guidance on choosing IW regularizations and optimization schemes.

Abstract

Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.
Paper Structure (25 sections, 4 theorems, 63 equations, 6 figures, 2 algorithms)

This paper contains 25 sections, 4 theorems, 63 equations, 6 figures, 2 algorithms.

Key Result

Theorem 1

Let $\lambda > 0$, $n \ge 1$, $\delta \in (0, 1)$, and let $\mathbb{P}$ be a fixed prior on $\Theta$. The following inequality holds with probability at least $1 - \delta$ for any distribution $\mathbb{Q}$ on $\Theta$: where ${\textsc{kl}}_1(\mathbb{Q}) = D_{\mathrm{KL}}(\mathbb{Q} \| \mathbb{P}) + \log \frac{4\sqrt{n}}{\delta}$, ${\textsc{kl}}_2(\mathbb{Q}) = D_{\mathrm{KL}}(\mathbb{Q} \| \mathb

Figures (6)

  • Figure 1: Performance of the learned policy with different PAC-Bayes pessimistic learning principles (our \ref{['corr:lin_reg_main']} and those in london2019bayesiansakhi2022pac) using the Clip IPS risk estimator in \ref{['eq:regs']} .
  • Figure 2: Performance of the policy learned by Bound Optimization\ref{['eq:objective_pac_bayes']} for different IW regularizations. The $x$-axis reflects the quality of the logging policy $\eta_0 \in [-0.5, 0.5]$. In the first four columns, we plot the reward of the learned policy using a fixed IW regularization technique (Clip, Har, IX, or ES as defined in \ref{['eq:regs']}) for various values of its hyperparameter within $[0,1]$. In the last column, we report the mean reward across these hyperparameter values.
  • Figure 3: Performance of the policy learned by Heuristic Optimization\ref{['eq:learning_principle']} for different IW regularizations. The $x$-axis reflects the quality of the logging policy $\eta_0 \in [-0.5, 0.5]$. In the first four columns, we plot the reward of the learned policy using a fixed IW regularization technique (Clip, Har, IX, or ES as defined in \ref{['eq:regs']}) for various values of its hyperparameter within $[0,1]$. In the last column, we report the mean reward across these hyperparameter values.
  • Figure 4: Performance of the policy learned by optimizing the bound in \ref{['corr:lin_reg_main']} for different IW regularizations. The $x$-axis reflects the quality of the logging policy $\eta_0 \in [-0.5, 0.5]$. In the first three columns, we plot the reward of the learned policy using a fixed IW regularization technique (Clip, IX, or ES as defined in \ref{['eq:regs']}) for various values of its hyperparameter within $[0,1]$. In the last column, we report the mean reward across these hyperparameter values.
  • Figure 5: Performance of the learned policy with two learning principles (our Heuristic Optimization\ref{['eq:learning_principle']} and the $L_2$ heuristic in london2019bayesian with varying values of their hyperparameters in a grid within $[10^{-5}, 10^{-3}]$) using the Clip IPS risk estimator in \ref{['eq:regs']} with fixed $\tau=1/\sqrt[4]{n}$.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Theorem 1
  • Corollary 2
  • Theorem 3: \ref{['thm:main_result']} Restated
  • proof
  • Lemma 4