Generalizing soft actor-critic algorithms to discrete action spaces

Le Zhang; Yong Gu; Xin Zhao; Yanshuo Zhang; Shu Zhao; Yifei Jin; Xinxin Wu

Generalizing soft actor-critic algorithms to discrete action spaces

Le Zhang, Yong Gu, Xin Zhao, Yanshuo Zhang, Shu Zhao, Yifei Jin, Xinxin Wu

TL;DR

This work generalizes soft actor-critic (SAC) to discrete action spaces by formulating a discrete SAC variant with an explicit policy head, enabling off-policy learning in discrete domains and allowing integration with the BBF Rainbow backbone. It provides theoretical convergence guarantees for the discrete setting and introduces variance-reduction and entropy-regularization techniques to stabilize learning. The authors further couple this discrete SAC with BBF to create SAC-BBF, achieving a new state-of-the-art IQM of $1.088$ on Atari 100K with RR $2$, and demonstrating significantly faster training than BBF at higher RR values. The results indicate that introducing explicit policy heads in model-free, sample-efficient RL is viable for discrete actions and can yield super-human performance with modest compute, offering practical implications for Atari benchmarks and broader discrete-action tasks.

Abstract

ATARI is a suite of video games used by reinforcement learning (RL) researchers to test the effectiveness of the learning algorithm. Receiving only the raw pixels and the game score, the agent learns to develop sophisticated strategies, even to the comparable level of a professional human games tester. Ideally, we also want an agent requiring very few interactions with the environment. Previous competitive model-free algorithms for the task use the valued-based Rainbow algorithm without any policy head. In this paper, we change it by proposing a practical discrete variant of the soft actor-critic (SAC) algorithm. The new variant enables off-policy learning using policy heads for discrete domains. By incorporating it into the advanced Rainbow variant, i.e., the ``bigger, better, faster'' (BBF), the resulting SAC-BBF improves the previous state-of-the-art interquartile mean (IQM) from 1.045 to 1.088, and it achieves these results using only replay ratio (RR) 2. By using lower RR 2, the training time of SAC-BBF is strictly one-third of the time required for BBF to achieve an IQM of 1.045 using RR 8. As a value of IQM greater than one indicates super-human performance, SAC-BBF is also the only model-free algorithm with a super-human level using only RR 2. The code is publicly available on GitHub at https://github.com/lezhang-thu/bigger-better-faster-SAC.

Generalizing soft actor-critic algorithms to discrete action spaces

TL;DR

on Atari 100K with RR

, and demonstrating significantly faster training than BBF at higher RR values. The results indicate that introducing explicit policy heads in model-free, sample-efficient RL is viable for discrete actions and can yield super-human performance with modest compute, offering practical implications for Atari benchmarks and broader discrete-action tasks.

Abstract

Paper Structure (36 sections, 5 theorems, 17 equations, 2 figures, 5 tables, 1 algorithm)

This paper contains 36 sections, 5 theorems, 17 equations, 2 figures, 5 tables, 1 algorithm.

Introduction
Related work
Competitive representatives in ATARI 100K
Previous results on discrete variants of SAC
Previous algorithms combining Q-learning with actor-critic
Preliminaries
The SAC algorithm
The BBF algorithm
Evaluation metrics
A discrete variant of SAC for standard maximum reward RL
Policy evaluation
Policy improvement
A practial algorithm
Variance reduction
An entropy bonus
...and 21 more sections

Key Result

Lemma 4.1

Let $\mathcal{T}^\pi$ be the Bellman backup operator defined in Eq. eq:bellman_backup, and let $Q^0:\mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ be a mapping. We define $Q^{k+1}=\mathcal{T}^\pi Q^k$. Then, as $k$ approaches infinity, the sequence $Q^k$ converges to the Q-value of $\pi$.

Figures (2)

Figure 1: Architecture of SAC-BBF. Modules within dashed boxes represent additions introduced by SAC-BBF. In this architecture, the target modules typically correspond to exponentially moving average (EMA) versions of the online counterparts. The encoders used are Impala-CNN espeholt2018impala, with each layer's width increased by a factor of four. Regarding the input of actions into the "conv. transition model," each action is encoded as a one-hot vector and then broadcasted to every location of the convolutional output from the encoder. The remaining modules in the architecture consist of linear layers.
Figure 2: Aggregate metrics with 95% stratified bootstrap CIs for representatives of RL algorithms within the Atari 100K benchmark: The results from SimPLe to SPR represent the default metrics provided in . The data for BBF-RR2 and BBF-RR8 are from the official repository of schwarzer2023bigger. The scripts in truncate only the first ten runs from all independent runs for each game. The statistics thus may differ from those presented in Table \ref{['tab:Atari_100K']} for BBF-RR2 and BBF-RR8.

Theorems & Definitions (14)

Lemma 4.1: Policy Evaluation
proof
Lemma 4.2: Policy Improvement
proof
Theorem 4.3: Policy Iteration
proof
Lemma 4.4
proof
Lemma 4.5: Variance reduction
proof
...and 4 more

Generalizing soft actor-critic algorithms to discrete action spaces

TL;DR

Abstract

Generalizing soft actor-critic algorithms to discrete action spaces

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (14)