Table of Contents
Fetching ...

FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

Jun Xue, Junze Wang, Xinming Zhang, Shanze Wang, Yanjun Chen, Wei Zhang

Abstract

Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality'' induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180\% and 400\% on the challenging \textit{Basketball} and \textit{Balance Hard} tasks.

FastDSAC: Unlocking the Potential of Maximum Entropy RL in High-Dimensional Humanoid Control

Abstract

Scaling Maximum Entropy Reinforcement Learning (RL) to high-dimensional humanoid control remains a formidable challenge, as the ``curse of dimensionality'' induces severe exploration inefficiency and training instability in expansive action spaces. Consequently, recent high-throughput paradigms have largely converged on deterministic policy gradients combined with massive parallel simulation. We challenge this compromise with FastDSAC, a framework that effectively unlocks the potential of maximum entropy stochastic policies for complex continuous control. We introduce Dimension-wise Entropy Modulation (DEM) to dynamically redistribute the exploration budget and enforce diversity, alongside a continuous distributional critic tailored to ensure value fidelity and mitigate high-dimensional value overestimation. Extensive evaluations on HumanoidBench and other continuous control tasks demonstrate that rigorously designed stochastic policies can consistently match or outperform deterministic baselines, achieving notable gains of 180\% and 400\% on the challenging \textit{Basketball} and \textit{Balance Hard} tasks.
Paper Structure (19 sections, 9 equations, 11 figures, 1 table)

This paper contains 19 sections, 9 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Performance of FastDSAC on high-dimensional humanoid control tasks. (a) Visualizations of the challenging Basketball and Balance Hard environments. (b, c) Evaluation curves comparing FastDSAC with the FastTD3 baseline. FastDSAC significantly outperforms FastTD3, achieving final returns 1.8× and 4.0× higher on the respective tasks with superior sample efficiency.
  • Figure 2: Overview of the FastDSAC architecture.(Left) Actor with DEM: The policy dynamically redistributes the exploration budget by modulating the base standard deviation $\hat{\sigma}(s)$ with weights $w_i$ (via element-wise multiplication $\odot$). These weights are derived from logits $l(s)$ and an environment-conditioned heterogeneity factor $\beta_e$ using a normalized Softmax. (Middle) Environment: Massively parallel environments collect uncorrelated experiences into a shared Replay Buffer for high-throughput training. (Right) Continuous Critic: The critic models the value function as a continuous Gaussian distribution to mitigate overestimation. It minimizes the KL Divergence between the current and target distributions, avoiding the quantization artifacts of discrete baselines (marked by $\times$).
  • Figure 3: Comparative evaluation on high-dimensional continuous control benchmarks. Learning curves across selected tasks from HumanoidBench (top two rows), IsaacLab (third row), and MuJoCo Playground (bottom row). FastDSAC consistently matches or outperforms FastTD3 and FastSAC (Standard). Notably, FastDSAC excels in precision-demanding (e.g., Basketball, Insert) and stability-critical (e.g., Balance Hard) tasks, while maintaining robust dexterous manipulation and rough-terrain locomotion performance.
  • Figure 4: Ablation study of DEM. Learning curves on HumanoidBench tasks (Balance Hard, Basketball, Hurdle) comparing FastDSAC (w/ vs. w/o DEM) against FastTD3 and FastSAC baselines.
  • Figure 5: Qualitative comparison on Basketball. FastDSAC (Top) coordinates the throw successfully while maintaining stability, whereas FastTD3 (Bottom) loses balance and fails.
  • ...and 6 more figures