Table of Contents
Fetching ...

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata

TL;DR

This work introduces a PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for temporal dependencies via the mixing time of the environment. It yields non-vacuous, data-dependent certificates and motivates PB-SAC, a practical actor-critic algorithm that optimizes the bound to guide exploration and learning in modern off-policy RL. The key contributions are a bound with explicit $ au_{ m min}$-dependence and improved scaling, the PB-SAC algorithm with posterior-guided exploration and a stable alternating optimization scheme, and empirical results showing informative certificates alongside competitive performance across continuous-control tasks. This framework bridges learning-theoretic guarantees and practical deep RL, enabling certified performance in sequential decision-making with potential impact on safety-critical applications.

Abstract

We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.

PAC-Bayesian Reinforcement Learning Trains Generalizable Policies

TL;DR

This work introduces a PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for temporal dependencies via the mixing time of the environment. It yields non-vacuous, data-dependent certificates and motivates PB-SAC, a practical actor-critic algorithm that optimizes the bound to guide exploration and learning in modern off-policy RL. The key contributions are a bound with explicit -dependence and improved scaling, the PB-SAC algorithm with posterior-guided exploration and a stable alternating optimization scheme, and empirical results showing informative certificates alongside competitive performance across continuous-control tasks. This framework bridges learning-theoretic guarantees and practical deep RL, enabling certified performance in sequential decision-making with potential impact on safety-critical applications.

Abstract

We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.

Paper Structure

This paper contains 46 sections, 6 theorems, 46 equations, 4 figures, 1 table, 1 algorithm.

Key Result

Lemma 3.1

Let $\mathfrak D$ be a set of trajectories and $\theta\in\Theta$ be fixed policy parameters. Suppose we form $\bar{\mathfrak D}$ by changing one transition, say the transition at time step $h\in[H]$ of trajectory $j\in[T]$, where $\xi^{(j)}_{h}=(s,a,r,s')$ is replaced with $\bar{\xi}^{(j)}_{h}=(\bar

Figures (4)

  • Figure 1: (a) Performance comparison between our PB-SAC, its baseline SAC, and PBAC from tasdighi2025deepexplorationpacbayes; (b) PAC-Bayes analysis of PB-SAC across environments. The empirical discounted return (dashed line) corresponds to $\mathbb{E}_{\theta \sim \rho}[-\hat{\mathcal{L}}_{\mathfrak{D}}(\theta)]$, and the certified discounted return (solid line) corresponds to the lower bound on $\mathbb{E}_{\theta \sim \rho}[-\mathcal{L}(\theta)]$ provided by Theorem \ref{['Th:PBRL']} (after rearranging the terms).
  • Figure 2: A basic four-state MDP
  • Figure 3: Illustration of our algorithm PB-SAC
  • Figure 4: (a) Performance comparison between our PB-SAC, its baseline SAC, and PBAC from tasdighi2025deepexplorationpacbayes; (b) PAC-Bayes analysis of PB-SAC across environments. The empirical discounted return (dashed line) corresponds to $\mathbb{E}_{\theta \sim \rho}[-\hat{\mathcal{L}}_{\mathfrak{D}}(\theta)]$, and the certified discounted return (solid line) corresponds to the lower bound on $\mathbb{E}_{\theta \sim \rho}[-\mathcal{L}(\theta)]$ provided by Theorem \ref{['Th:PBRL']} (after rearranging the terms).

Theorems & Definitions (7)

  • Lemma 3.1: Bounded differences
  • Theorem 3.2
  • Lemma A.1: Markov’s Inequality
  • Lemma A.2: Change of measure
  • Lemma B.1: MGF bound for Markov chains
  • Theorem C.1: Policy-Level REINFORCE
  • proof