Table of Contents
Fetching ...

PAC-Bayesian Soft Actor-Critic Learning

Bahareh Tasdighi, Abdullah Akgül, Manuel Haussmann, Kenny Kazimirzak Brink, Melih Kandemir

TL;DR

This work addresses the instability and sample-inefficiency of actor-critic RL caused by critic error by introducing PAC4SAC, which trains a single randomized critic using a Probably Approximately Correct Bayesian bound as the learning objective. It further enhances online exploration by critic-guided multiple-shot action search, enabling the actor to evaluate several imaginary futures and pick the best, which improves policy improvement dynamics. The three-term critic loss combines Bellman-consistency, a conservative value update, and an exploration term, yielding better generalization and faster learning across four continuous-control tasks with lower cumulative regret. Overall, PAC4SAC demonstrates that PAC-Bayesian principles can be effectively integrated into deep RL to achieve robust performance with competitive compute, suggesting fruitful directions for model-based extensions and more sophisticated bounds.

Abstract

Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably against the state-of-the-art SAC implementation on multiple classical control and locomotion tasks in terms of both sample efficiency and regret.

PAC-Bayesian Soft Actor-Critic Learning

TL;DR

This work addresses the instability and sample-inefficiency of actor-critic RL caused by critic error by introducing PAC4SAC, which trains a single randomized critic using a Probably Approximately Correct Bayesian bound as the learning objective. It further enhances online exploration by critic-guided multiple-shot action search, enabling the actor to evaluate several imaginary futures and pick the best, which improves policy improvement dynamics. The three-term critic loss combines Bellman-consistency, a conservative value update, and an exploration term, yielding better generalization and faster learning across four continuous-control tasks with lower cumulative regret. Overall, PAC4SAC demonstrates that PAC-Bayesian principles can be effectively integrated into deep RL to achieve robust performance with competitive compute, suggesting fruitful directions for model-based extensions and more sophisticated bounds.

Abstract

Actor-critic algorithms address the dual goals of reinforcement learning (RL), policy evaluation and improvement via two separate function approximators. The practicality of this approach comes at the expense of training instability, caused mainly by the destructive effect of the approximation errors of the critic on the actor. We tackle this bottleneck by employing an existing Probably Approximately Correct (PAC) Bayesian bound for the first time as the critic training objective of the Soft Actor-Critic (SAC) algorithm. We further demonstrate that online learning performance improves significantly when a stochastic actor explores multiple futures by critic-guided random search. We observe our resulting algorithm to compare favorably against the state-of-the-art SAC implementation on multiple classical control and locomotion tasks in terms of both sample efficiency and regret.
Paper Structure (27 sections, 2 theorems, 23 equations, 2 figures, 3 tables)

This paper contains 27 sections, 2 theorems, 23 equations, 2 figures, 3 tables.

Key Result

theorem 1

For any posterior measure $\mu$ and any prior measure $\mu_0$ defined on the space of action-value functions $Q$ and any data set $N$ containing $(s,a,r,s')$ collected from a single execution of a fixed policy $\pi$, the following inequality holds with probability greater than $1- \delta$: where $||\Gamma_\pi^N||$ is the operator norm of the matrix in its argument.

Figures (2)

  • Figure 1: Our novel Probably Approximately Correct Bayes for Soft Actor Critic (PAC4SAC) algorithm trains a critic with random parameters $\theta$ for the first time using a PAC Bayesian bound as its training objective. The random critic enables effective random optimal action search when used as a guide for a stochastic policy. The resulting algorithm solves online reinforcement learning tasks with fewer environment interactions and smaller cumulative regret than its counterparts.
  • Figure 2: The effect of critic-guided random optimal action search (multiple shooting) on the performance of our PAC4SAC algorithm demonstrated on the cartpole swingup environment. Taking more samples reduces cumulative regret (left panel) and improves sample efficiency (right panel).

Theorems & Definitions (2)

  • theorem 1
  • theorem 2