Rethinking Soft Actor-Critic in High-Dimensional Action Spaces: The Cost of Ignoring Distribution Shift
Yanjun Chen, Xinming Zhang, Xianghui Wang, Zhiqiang Xu, Xiaoyu Shen, Wei Zhang
TL;DR
This work addresses the distribution shift introduced by the tanh action-squashing used in Soft Actor-Critic, which distorts the Gaussian action distribution and biases action selection in high-dimensional continuous control. It develops a formal change-of-variables framework to derive the exact transformed action PDF $p(y)=p(u)|\\frac{du}{dy}|$, where $|\\frac{du}{dy}|=\\frac{1}{1-y^2}$, and demonstrates that the mode of the transformed distribution does not align with $\\tanh(\\mu)$, especially as dimensionality grows. The authors validate these insights on HumanoidBench, comparing Standard SAC to a Corrected SAC that selects actions by accounting for the transformed distribution's mode, reporting improvements in cumulative rewards, reliability (IQM/Median), performance profiles, and sample efficiency. The results imply that addressing transformation-induced biases is essential for robust, high-dimensional continuous control and may generalize to other nonlinear bounded-action schemes beyond SAC.
Abstract
Soft Actor-Critic algorithm is widely recognized for its robust performance across a range of deep reinforcement learning tasks, where it leverages the tanh transformation to constrain actions within bounded limits. However, this transformation induces a distribution shift, distorting the original Gaussian action distribution and potentially leading the policy to select suboptimal actions, particularly in high-dimensional action spaces. In this paper, we conduct a comprehensive theoretical and empirical analysis of this distribution shift, deriving the precise probability density function (PDF) for actions following the tanh transformation to clarify the misalignment introduced between the transformed distribution's mode and the intended action output. We substantiate these theoretical insights through extensive experiments on high-dimensional tasks within the HumanoidBench benchmark. Our findings indicate that accounting for this distribution shift substantially enhances SAC's performance, resulting in notable improvements in cumulative rewards, sample efficiency, and reliability across tasks. These results underscore a critical consideration for SAC and similar algorithms: addressing transformation-induced distribution shifts is essential to optimizing policy effectiveness in high-dimensional deep reinforcement learning environments, thereby expanding the robustness and applicability of SAC in complex control tasks.
