Unified PAC-Bayesian Study of Pessimism for Offline Policy Learning with Regularized Importance Sampling
Imad Aouali, Victor-Emmanuel Brunel, David Rohde, Anna Korba
TL;DR
This work addresses offline policy learning under pessimism by introducing a unified PAC-Bayesian framework that applies to a broad family of regularized importance weights. It derives a tractable two-sided generalization bound for regularized IPS and proposes two learning principles—Bound Optimization and Heuristic Optimization—that are compatible with linear and non-linear IW regularizations. Theoretical results are complemented by experiments on MNIST and Fashion-MNIST, demonstrating that standard IW regularizations (Clip, IX, ES) perform well in OPL and that the proposed PAC-Bayesian approach can surpass or match existing baselines under various logging-policy qualities. Overall, the study provides a generic, comparable framework for evaluating pessimistic learning strategies in offline policy learning, with practical guidance on choosing IW regularizations and optimization schemes.
Abstract
Off-policy learning (OPL) often involves minimizing a risk estimator based on importance weighting to correct bias from the logging policy used to collect data. However, this method can produce an estimator with a high variance. A common solution is to regularize the importance weights and learn the policy by minimizing an estimator with penalties derived from generalization bounds specific to the estimator. This approach, known as pessimism, has gained recent attention but lacks a unified framework for analysis. To address this gap, we introduce a comprehensive PAC-Bayesian framework to examine pessimism with regularized importance weighting. We derive a tractable PAC-Bayesian generalization bound that universally applies to common importance weight regularizations, enabling their comparison within a single framework. Our empirical results challenge common understanding, demonstrating the effectiveness of standard IW regularization techniques.
