PAC-Bayesian Soft Actor-Critic Learning
Bahareh Tasdighi, Abdullah Akgül, Manuel Haussmann, Kenny Kazimirzak Brink, Melih Kandemir
TL;DR
This work addresses the instability and sample-inefficiency of actor-critic RL caused by critic error by introducing PAC4SAC, which trains a single randomized critic using a Probably Approximately Correct Bayesian bound as the learning objective. It further enhances online exploration by critic-guided multiple-shot action search, enabling the actor to evaluate several imaginary futures and pick the best, which improves policy improvement dynamics. The three-term critic loss combines Bellman-consistency, a conservative value update, and an exploration term, yielding better generalization and faster learning across four continuous-control tasks with lower cumulative regret. Overall, PAC4SAC demonstrates that PAC-Bayesian principles can be effectively integrated into deep RL to achieve robust performance with competitive compute, suggesting fruitful directions for model-based extensions and more sophisticated bounds.
Abstract
Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably against the state-of-the-art SAC implementation on multiple classical control and locomotion tasks in terms of both sample efficiency and regret.
