Table of Contents
Fetching ...

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Donghu Kim, Youngdo Lee, Minho Park, Kinam Kim, I Made Aswin Nahendra, Takuma Seno, Sehee Min, Daniel Palenicek, Florian Vogt, Danica Kragic, Jan Peters, Jaegul Choo, Hojoon Lee

Abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

FlashSAC: Fast and Stable Off-Policy Reinforcement Learning for High-Dimensional Robot Control

Abstract

Reinforcement learning (RL) is a core approach for robot control when expert demonstrations are unavailable. On-policy methods such as Proximal Policy Optimization (PPO) are widely used for their stability, but their reliance on narrowly distributed on-policy data limits accurate policy evaluation in high-dimensional state and action spaces. Off-policy methods can overcome this limitation by learning from a broader state-action distribution, yet suffer from slow convergence and instability, as fitting a value function over diverse data requires many gradient updates, causing critic errors to accumulate through bootstrapping. We present FlashSAC, a fast and stable off-policy RL algorithm built on Soft Actor-Critic. Motivated by scaling laws observed in supervised learning, FlashSAC sharply reduces gradient updates while compensating with larger models and higher data throughput. To maintain stability at increased scale, FlashSAC explicitly bounds weight, feature, and gradient norms, curbing critic error accumulation. Across over 60 tasks in 10 simulators, FlashSAC consistently outperforms PPO and strong off-policy baselines in both final performance and training efficiency, with the largest gains on high-dimensional tasks such as dexterous manipulation. In sim-to-real humanoid locomotion, FlashSAC reduces training time from hours to minutes, demonstrating the promise of off-policy RL for sim-to-real transfer.

Paper Structure

This paper contains 50 sections, 8 equations, 36 figures, 15 tables.

Figures (36)

  • Figure 1: FlashSAC Architecture. The architecture consists of stacked inverted residual blocks with pre-activation batch normalization and post-RMS normalization.
  • Figure 2: Results on State-Based RL, GPU-based Simulators. Learning curves on select tasks from IsaacLab mittal2025isaaclab, ManiSkill tao2024maniskill3, Genesis genesis2025Genesis, and MuJoCo Playground zakka2025mujoco. We evaluate performance on (a) low-dimensional tasks with gripper manipulation and quadruped locomotion, and (b) high-dimensional tasks involving dexterous manipulation and humanoid locomotion. While FlashSAC is comparable to PPO schwarke2025rslrl in low-dimensional tasks, FlashSAC significantly outperforms PPO in high-dimensional tasks.
  • Figure 3: Results on State-Based RL, CPU-based Simulators. Learning curves on select tasks from MuJoCo towers2024gymnasiumtodorov2012mujoco, DMC tassa2018dmc, HumanoidBench sferrazza2024humanoidbench and MyoSuite caggiano2022myosuite. We primarily evaluate high-dimensional tasks, involving dexterous manipulation and humanoid locomotion. FlashSAC significantly outperforms PPO, as well as strong off-policy RL and model-based RL baselines in both compute efficiency and asymptotic performance.
  • Figure 4: Results on Vision-Based RL. Learning curves on selected tasks from vision-based DMControl Suite tassa2018dmc. We assess learning performance in low-dimensional environments, including pendulum manipulation and bipedal locomotion. FlashSAC achieves better compute efficiency and higher asymptotic performance.
  • Figure 5: Sim-to-real Stair Climbing on Unitree G1. FlashSAC achieves stable real-world stair climbing after only 4 hours of training on simulation, whereas PPO requires nearly 20 hours to reach the same capability.
  • ...and 31 more figures