Refined PAC-Bayes Bounds for Offline Bandits

Amaury Gouverneur; Tobias J. Oechtering; Mikael Skoglund

Refined PAC-Bayes Bounds for Offline Bandits

Amaury Gouverneur, Tobias J. Oechtering, Mikael Skoglund

TL;DR

This work tackles off-policy evaluation in bandit settings by refining PAC-Bayes bounds for the importance-sampling reward estimator $\widehat{\mathscr{R}}^{\mathrm{IS}}(\pi,H^t)$. Building on prior PAC-Bayes results and a parameter-optimization method based on discretizing event spaces, it introduces two parameter-free bounds, one using Hoeffding-Azuma and one using Bernstein inequalities, that achieve near-optimal rates. The key technical contribution is an optimized, data-adaptive bound construction that removes the need to pre-specify the trade-off parameter $\lambda$, while preserving uniform validity over policies. These refined bounds improve reliability of offline policy evaluation and lay groundwork for PAC-Bayes regret analyses in bandit problems, with potential impact on offline policy selection and safe exploration in sequential decision tasks.

Abstract

In this paper, we present refined probabilistic bounds on empirical reward estimates for off-policy learning in bandit problems. We build on the PAC-Bayesian bounds from Seldin et al. (2010) and improve on their results using a new parameter optimization approach introduced by Rodríguez et al. (2024). This technique is based on a discretization of the space of possible events to optimize the "in probability" parameter. We provide two parameter-free PAC-Bayes bounds, one based on Hoeffding-Azuma's inequality and the other based on Bernstein's inequality. We prove that our bounds are almost optimal as they recover the same rate as would be obtained by setting the "in probability" parameter after the realization of the data.

Refined PAC-Bayes Bounds for Offline Bandits

TL;DR

This work tackles off-policy evaluation in bandit settings by refining PAC-Bayes bounds for the importance-sampling reward estimator

. Building on prior PAC-Bayes results and a parameter-optimization method based on discretizing event spaces, it introduces two parameter-free bounds, one using Hoeffding-Azuma and one using Bernstein inequalities, that achieve near-optimal rates. The key technical contribution is an optimized, data-adaptive bound construction that removes the need to pre-specify the trade-off parameter

, while preserving uniform validity over policies. These refined bounds improve reliability of offline policy evaluation and lay groundwork for PAC-Bayes regret analyses in bandit problems, with potential impact on offline policy selection and safe exploration in sequential decision tasks.

Refined PAC-Bayes Bounds for Offline Bandits

TL;DR

Abstract

Refined PAC-Bayes Bounds for Offline Bandits

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Theorems & Definitions (10)