Revisiting Discrete Soft Actor-Critic

Haibin Zhou; Tong Wei; Zichuan Lin; junyou li; Junliang Xing; Yuanchun Shi; Li Shen; Chao Yu; Deheng Ye

Revisiting Discrete Soft Actor-Critic

Haibin Zhou, Tong Wei, Zichuan Lin, junyou li, Junliang Xing, Yuanchun Shi, Li Shen, Chao Yu, Deheng Ye

TL;DR

This work revisits vanilla discrete SAC and provides an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings, and proposes Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues.

Abstract

We study the adaption of Soft Actor-Critic (SAC), which is considered as a state-of-the-art reinforcement learning (RL) algorithm, from continuous action space to discrete action space. We revisit vanilla discrete SAC and provide an in-depth understanding of its Q value underestimation and performance instability issues when applied to discrete settings. We thereby propose Stable Discrete SAC (SDSAC), an algorithm that leverages entropy-penalty and double average Q-learning with Q-clip to address these issues. Extensive experiments on typical benchmarks with discrete action space, including Atari games and a large-scale MOBA game, show the efficacy of our proposed method. Our code is at: https://github.com/coldsummerday/SD-SAC.git.

Revisiting Discrete Soft Actor-Critic

TL;DR

Abstract

Paper Structure (40 sections, 17 equations, 25 figures, 4 tables, 1 algorithm)

This paper contains 40 sections, 17 equations, 25 figures, 4 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Failure Modes of Vanilla Discrete SAC
Unstable Coupling Training
Pessimistic Exploration
Improvements of SAC Failure Modes
Entropy-Penalty
Double Average Q-learning with Q-clip
Psudocode
Experiments
Experimental Setup
Overall Performance
Ablation Study
Qualitative Analysis
...and 25 more sections

Figures (25)

Figure 1: Gameplay screenshot of the Atari Game Asterix, including the player-controlled Asterix (yellow box), scoring objects (green box) and life-losing lyres (orange box) that appear in rounds. Deceptive rewards appear in the early stage of game when there are only scoring objects.
Figure 2: Measuring Q variance, estimation of Q-value, policy entropy, episode length, steps with rewards, and score on Atari Game Asterix with discrete SAC over 10 million timesteps.
Figure 3: The results of Atari game Frostbite/MsPacman environment over 2/5 million time steps: a) Measuring Q-value estimates of discrete SAC; b) Measuring Q-value estimates of discrete SAC with single Q; c) Score comparison between discrete SAC and discrete SAC with single Q.
Figure 4: Measuring Q function variance, policy action entropy, estimation of Q-value, and score on Atari game Asterix compared between discrete SAC, discrete SAC with KL-penalty and discrete SAC with entropy-penalty over 10 million time steps.
Figure 5: Measuring estimation of Q-value and score on Atari Game Frostbite/MsPacman environment compared between discrete SAC, discrete SAC with REDQ, discrete SAC with REM, and ours (SD-SAC) over 10 million steps.
...and 20 more figures

Revisiting Discrete Soft Actor-Critic

TL;DR

Abstract

Revisiting Discrete Soft Actor-Critic

Authors

TL;DR

Abstract

Table of Contents

Figures (25)