S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

Safa Messaoud; Billel Mokeddem; Zhenghai Xue; Linsey Pang; Bo An; Haipeng Chen; Sanjay Chawla

S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

Safa Messaoud, Billel Mokeddem, Zhenghai Xue, Linsey Pang, Bo An, Haipeng Chen, Sanjay Chawla

TL;DR

This paper introduces S$^2$AC, an actor-critic MaxEnt RL algorithm that uses a parameterized SVGD sampler to realize expressive, multimodal policies represented as EBMs over Q-values. A key contribution is a closed-form entropy estimate for the SVGD-induced policy, derived from the invertible SVGD update and change-of-variable formulas, enabling principled entropy optimization without costly sampling. The method, including a parameterized initial distribution to accelerate convergence, yields improved multimodal behavior and robustness in multi-goal tasks and demonstrates competitive to superior performance on MuJoCo benchmarks, with an amortized test-time variant to reduce inference cost. Overall, S$^2$AC provides a scalable, expressive alternative to SAC and SQL for MaxEnt RL, with demonstrated benefits in exploration, stability, and policy robustness across domains.

Abstract

Learning expressive stochastic policies instead of deterministic ones has been proposed to achieve better stability, sample complexity, and robustness. Notably, in Maximum Entropy Reinforcement Learning (MaxEnt RL), the policy is modeled as an expressive Energy-Based Model (EBM) over the Q-values. However, this formulation requires the estimation of the entropy of such EBMs, which is an open problem. To address this, previous MaxEnt RL methods either implicitly estimate the entropy, resulting in high computational complexity and variance (SQL), or follow a variational inference procedure that fits simplified actor distributions (e.g., Gaussian) for tractability (SAC). We propose Stein Soft Actor-Critic (S$^2$AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, S$^2$AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S$^2$AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://github.com/SafaMessaoud/S2AC-Energy-Based-RL-with-Stein-Soft-Actor-Critic

S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

TL;DR

This paper introduces S

AC, an actor-critic MaxEnt RL algorithm that uses a parameterized SVGD sampler to realize expressive, multimodal policies represented as EBMs over Q-values. A key contribution is a closed-form entropy estimate for the SVGD-induced policy, derived from the invertible SVGD update and change-of-variable formulas, enabling principled entropy optimization without costly sampling. The method, including a parameterized initial distribution to accelerate convergence, yields improved multimodal behavior and robustness in multi-goal tasks and demonstrates competitive to superior performance on MuJoCo benchmarks, with an amortized test-time variant to reduce inference cost. Overall, S

AC provides a scalable, expressive alternative to SAC and SQL for MaxEnt RL, with demonstrated benefits in exploration, stability, and policy robustness across domains.

Abstract

AC), a MaxEnt RL algorithm that learns expressive policies without compromising efficiency. Specifically, S

AC uses parameterized Stein Variational Gradient Descent (SVGD) as the underlying policy. We derive a closed-form expression of the entropy of such policies. Our formula is computationally efficient and only depends on first-order derivatives and vector products. Empirical results show that S

AC yields more optimal solutions to the MaxEnt objective than SQL and SAC in the multi-goal environment, and outperforms SAC and SQL on the MuJoCo benchmark. Our code is available at: https://github.com/SafaMessaoud/S2AC-Energy-Based-RL-with-Stein-Soft-Actor-Critic

Paper Structure (35 sections, 10 theorems, 42 equations, 17 figures, 4 tables, 2 algorithms)

This paper contains 35 sections, 10 theorems, 42 equations, 17 figures, 4 tables, 2 algorithms.

Introduction
Preliminaries
Samplers for Energy-based Models
Maximum-Entropy RL
Approach
Stein Soft Actor Critic
A Closed-Form Expression of the Policy's Entropy
Invertible Policies
Results
Entropy Evaluation
Multi-goal Experiments
Mujoco Experiments
Related Work
Conclusion
Summary
...and 20 more sections

Key Result

Theorem 3.1

Let $F:\mathbb{R}^{n} \rightarrow \mathbb{R}^{n}$ be an invertible transformation of the form $F(a) = a + \epsilon h(a)$. We denote by $q^L(a^L)$ the distribution obtained from repeatedly applying $F$ to a set of samples $\{a^{0}\}$ from an initial distribution $q^{0}(a^0)$ over $L$ steps, i.e., $a^ Here, $d$ is the dimensionality of $a$, i.e., $a\in \mathbb{R}^d$ and $\mathcal{O}(\epsilon^2 dL)$

Figures (17)

Figure 1: Comparing S$^2$AC to SQL and SAC. S$^2$AC with a parameterized policy is reduced to SAC if the number of SVGD steps is 0. SQL becomes equivalent to S$^2$AC if the entropy is evaluated explicitly with our derived formula.
Figure 2: S$^2$AC learns a more optimal solution to the MaxEnt RL objective than SAC and SQL. We design a multigoal environment where an agent starts from the center of the 2-d map and tries to reach one of the three goals ($G_1$, $G_2$, and $G_3$). The maximum expected future reward (level curves) is the same for all the goals but the expected future entropy is different (higher on the path to $G_2/G_3$): the action distribution $\pi(a|s)$ is bi-modal on the path to the left ($G_2$ and $G_3$) and unimodal to the right ($G_1$). Hence, we expect the optimal policy for the MaxEnt RL objective to assign more weights to $G_2$ and $G_3$. We visualize trajectories (in blue) sampled from the policies learned using SAC, SQL, and S$^2$AC. SAC quickly commits to a single mode due to its actor being tied to a Gaussian policy. Though SQL also recovers the three modes, the trajectories are evenly distributed. S$^2$AC recovers all the modes and approaches the left two goals more frequently. This indicates that it successfully maximizes not only the expected future reward but also the expected future entropy.
Figure 3: S$^2$AC($\phi, \theta$) achieves faster convergence to the target distribution (in orange) than S$^2$AC($\phi$) by parameterizing the initial distribution $\mathcal{N}(\mu_{\theta},\sigma_{\theta})$ of the SVGD sampler.
Figure 4: Entropy evaluation results.
Figure 5: Multigoal env.
...and 12 more figures

Theorems & Definitions (15)

Theorem 3.1
Proposition 3.2: SVGD invertibility
Theorem 3.3
Proposition 3.4: SGLD, HMC
proof
proof
Theorem
proof
Theorem F.1: Implicit function theorem
Proposition : SGLD
...and 5 more

S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

TL;DR

Abstract

S$^2$AC: Energy-Based Reinforcement Learning with Stein Soft Actor Critic

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (17)

Theorems & Definitions (15)