Table of Contents
Fetching ...

Rethinking Soft Actor-Critic in High-Dimensional Action Spaces: The Cost of Ignoring Distribution Shift

Yanjun Chen, Xinming Zhang, Xianghui Wang, Zhiqiang Xu, Xiaoyu Shen, Wei Zhang

TL;DR

This work addresses the distribution shift introduced by the tanh action-squashing used in Soft Actor-Critic, which distorts the Gaussian action distribution and biases action selection in high-dimensional continuous control. It develops a formal change-of-variables framework to derive the exact transformed action PDF $p(y)=p(u)|\\frac{du}{dy}|$, where $|\\frac{du}{dy}|=\\frac{1}{1-y^2}$, and demonstrates that the mode of the transformed distribution does not align with $\\tanh(\\mu)$, especially as dimensionality grows. The authors validate these insights on HumanoidBench, comparing Standard SAC to a Corrected SAC that selects actions by accounting for the transformed distribution's mode, reporting improvements in cumulative rewards, reliability (IQM/Median), performance profiles, and sample efficiency. The results imply that addressing transformation-induced biases is essential for robust, high-dimensional continuous control and may generalize to other nonlinear bounded-action schemes beyond SAC.

Abstract

Soft Actor-Critic algorithm is widely recognized for its robust performance across a range of deep reinforcement learning tasks, where it leverages the tanh transformation to constrain actions within bounded limits. However, this transformation induces a distribution shift, distorting the original Gaussian action distribution and potentially leading the policy to select suboptimal actions, particularly in high-dimensional action spaces. In this paper, we conduct a comprehensive theoretical and empirical analysis of this distribution shift, deriving the precise probability density function (PDF) for actions following the tanh transformation to clarify the misalignment introduced between the transformed distribution's mode and the intended action output. We substantiate these theoretical insights through extensive experiments on high-dimensional tasks within the HumanoidBench benchmark. Our findings indicate that accounting for this distribution shift substantially enhances SAC's performance, resulting in notable improvements in cumulative rewards, sample efficiency, and reliability across tasks. These results underscore a critical consideration for SAC and similar algorithms: addressing transformation-induced distribution shifts is essential to optimizing policy effectiveness in high-dimensional deep reinforcement learning environments, thereby expanding the robustness and applicability of SAC in complex control tasks.

Rethinking Soft Actor-Critic in High-Dimensional Action Spaces: The Cost of Ignoring Distribution Shift

TL;DR

This work addresses the distribution shift introduced by the tanh action-squashing used in Soft Actor-Critic, which distorts the Gaussian action distribution and biases action selection in high-dimensional continuous control. It develops a formal change-of-variables framework to derive the exact transformed action PDF , where , and demonstrates that the mode of the transformed distribution does not align with , especially as dimensionality grows. The authors validate these insights on HumanoidBench, comparing Standard SAC to a Corrected SAC that selects actions by accounting for the transformed distribution's mode, reporting improvements in cumulative rewards, reliability (IQM/Median), performance profiles, and sample efficiency. The results imply that addressing transformation-induced biases is essential for robust, high-dimensional continuous control and may generalize to other nonlinear bounded-action schemes beyond SAC.

Abstract

Soft Actor-Critic algorithm is widely recognized for its robust performance across a range of deep reinforcement learning tasks, where it leverages the tanh transformation to constrain actions within bounded limits. However, this transformation induces a distribution shift, distorting the original Gaussian action distribution and potentially leading the policy to select suboptimal actions, particularly in high-dimensional action spaces. In this paper, we conduct a comprehensive theoretical and empirical analysis of this distribution shift, deriving the precise probability density function (PDF) for actions following the tanh transformation to clarify the misalignment introduced between the transformed distribution's mode and the intended action output. We substantiate these theoretical insights through extensive experiments on high-dimensional tasks within the HumanoidBench benchmark. Our findings indicate that accounting for this distribution shift substantially enhances SAC's performance, resulting in notable improvements in cumulative rewards, sample efficiency, and reliability across tasks. These results underscore a critical consideration for SAC and similar algorithms: addressing transformation-induced distribution shifts is essential to optimizing policy effectiveness in high-dimensional deep reinforcement learning environments, thereby expanding the robustness and applicability of SAC in complex control tasks.

Paper Structure

This paper contains 21 sections, 6 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: Comparison of probability densities between the original Gaussian distribution and the transformed distribution after tanh transformation.
  • Figure 2: Impact of the tanh transformation on Gaussian distributions with different means and a fixed standard deviation (0.5). The figure highlights the variation in probability density introduced by the transformation.
  • Figure 3: Comparison of action selection with and without accounting for distribution shift. The orange point represents the transformed location of the original mode, while the red point indicates the mode of the tanh-transformed distribution.
  • Figure 4: Illustration of the distribution shift in a 2D action space induced by the tanh transformation. Each dimension contributes independently to the overall misalignment, amplifying the effect in high-dimensional action spaces.
  • Figure 5: Overview of the tasks from the HumanoidBench benchmark: (a) cube, (b) powerlift, and (c) bookshelf.
  • ...and 4 more figures