PAC-Bayesian Reinforcement Learning Trains Generalizable Policies
Abdelkrim Zitouni, Mehdi Hennequin, Juba Agoun, Ryan Horache, Nadia Kabachi, Omar Rivasplata
TL;DR
This work introduces a PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for temporal dependencies via the mixing time of the environment. It yields non-vacuous, data-dependent certificates and motivates PB-SAC, a practical actor-critic algorithm that optimizes the bound to guide exploration and learning in modern off-policy RL. The key contributions are a bound with explicit $ au_{ m min}$-dependence and improved scaling, the PB-SAC algorithm with posterior-guided exploration and a stable alternating optimization scheme, and empirical results showing informative certificates alongside competitive performance across continuous-control tasks. This framework bridges learning-theoretic guarantees and practical deep RL, enabling certified performance in sequential decision-making with potential impact on safety-critical applications.
Abstract
We derive a novel PAC-Bayesian generalization bound for reinforcement learning that explicitly accounts for Markov dependencies in the data, through the chain's mixing time. This contributes to overcoming challenges in obtaining generalization guarantees for reinforcement learning, where the sequential nature of data breaks the independence assumptions underlying classical bounds. Our bound provides non-vacuous certificates for modern off-policy algorithms like Soft Actor-Critic. We demonstrate the bound's practical utility through PB-SAC, a novel algorithm that optimizes the bound during training to guide exploration. Experiments across continuous control tasks show that our approach provides meaningful confidence certificates while maintaining competitive performance.
