Table of Contents
Fetching ...

Keep on Going: Learning Robust Humanoid Motion Skills via Selective Adversarial Training

Yang Zhang, Zhanxiang Cao, Buqing Nie, Haoyang Li, Zhong Jiangwei, Qiao Sun, Xiaoyi Hu, Xiaokang Yang, Yue Gao

TL;DR

Humanoid motion policies trained with RL often struggle to sustain stability over long horizons under real-world disturbances. The authors propose SA2RT, a selective adversarial training framework where a learnable Selective Attack Policy targets vulnerable states under an attack-budget constraint, and a non-zero-sum adversarial training regime ensures continual improvement for both attacker and motion policies. Through alternating optimization, SA2RT yields robust perceptive locomotion and whole-body control on a Unitree G1, with substantial gains in complex terrain traversal and long-horizon trajectory tracking, including real-world demonstrations that outperform domain randomization baselines. The results indicate that targeted, budget-limited perturbations are an effective driver for learning resilient, long-horizon humanoid skills that transfer from simulation to reality, reducing failure modes caused by sensor/actuator noise and environmental disturbances. SA2RT also provides interpretable insights into how vulnerability concentrates at terrain transitions and how hyperparameters like $\lambda$ control attack frequency and robustness.

Abstract

Humanoid robots are expected to operate reliably over long horizons while executing versatile whole-body skills. Yet Reinforcement Learning (RL) motion policies typically lose stability under prolonged operation, sensor/actuator noise, and real world disturbances. In this work, we propose a Selective Adversarial Attack for Robust Training (SA2RT) to enhance the robustness of motion skills. The adversary is learned to identify and sparsely perturb the most vulnerable states and actions under an attack-budget constraint, thereby exposing true weakness without inducing conservative overfitting. The resulting non-zero sum, alternating optimization continually strengthens the motion policy against the strongest discovered attacks. We validate our approach on the Unitree G1 humanoid robot across perceptive locomotion and whole-body control tasks. Experimental results show that adversarially trained policies improve the terrain traversal success rate by 40%, reduce the trajectory tracking error by 32%, and maintain long horizon mobility and tracking performance. Together, these results demonstrate that selective adversarial attacks are an effective driver for learning robust, long horizon humanoid motion skills.

Keep on Going: Learning Robust Humanoid Motion Skills via Selective Adversarial Training

TL;DR

Humanoid motion policies trained with RL often struggle to sustain stability over long horizons under real-world disturbances. The authors propose SA2RT, a selective adversarial training framework where a learnable Selective Attack Policy targets vulnerable states under an attack-budget constraint, and a non-zero-sum adversarial training regime ensures continual improvement for both attacker and motion policies. Through alternating optimization, SA2RT yields robust perceptive locomotion and whole-body control on a Unitree G1, with substantial gains in complex terrain traversal and long-horizon trajectory tracking, including real-world demonstrations that outperform domain randomization baselines. The results indicate that targeted, budget-limited perturbations are an effective driver for learning resilient, long-horizon humanoid skills that transfer from simulation to reality, reducing failure modes caused by sensor/actuator noise and environmental disturbances. SA2RT also provides interpretable insights into how vulnerability concentrates at terrain transitions and how hyperparameters like control attack frequency and robustness.

Abstract

Humanoid robots are expected to operate reliably over long horizons while executing versatile whole-body skills. Yet Reinforcement Learning (RL) motion policies typically lose stability under prolonged operation, sensor/actuator noise, and real world disturbances. In this work, we propose a Selective Adversarial Attack for Robust Training (SA2RT) to enhance the robustness of motion skills. The adversary is learned to identify and sparsely perturb the most vulnerable states and actions under an attack-budget constraint, thereby exposing true weakness without inducing conservative overfitting. The resulting non-zero sum, alternating optimization continually strengthens the motion policy against the strongest discovered attacks. We validate our approach on the Unitree G1 humanoid robot across perceptive locomotion and whole-body control tasks. Experimental results show that adversarially trained policies improve the terrain traversal success rate by 40%, reduce the trajectory tracking error by 32%, and maintain long horizon mobility and tracking performance. Together, these results demonstrate that selective adversarial attacks are an effective driver for learning robust, long horizon humanoid motion skills.

Paper Structure

This paper contains 39 sections, 10 equations, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: Snapshots of the humanoid robot executing whole-body trajectory tracking. WBC-SAP can track challenging dynamic trajectories over an extended duration, demonstrating that the SA2RT significantly improves the robustness of motion policies.
  • Figure 2: Overview of the SA2RT. The SAP identifies vulnerabilities in motion states and generates adversarial samples by applying perturbations in both state spaces and action spaces. Through alternating adversarial training under non-zero-sum game, the motion policy continuously addresses its own vulnerabilities using adversarial samples, enhancing its robustness against perturbations. During deployment, the robust motion skills are deployed to real robots without requiring the SAP, enabling robust whole-body motion control for humanoid robots.
  • Figure 3: Performance analysis of whole-body control. Trajectory tracking errors of WBC-DR and WBC-SAP are evaluated in clean environments and DR environments. WBC-SAP outperforms WBC-DR across all evaluation metrics, demonstrating that the SA2RT effectively enhances the robustness and tracking performance of WBC policies.
  • Figure 4: Impact of different attack policies on motion policy performance. (a) Rewards of motion policies learned via different attack policies under varying perturbation levels $L_{p}$. (b) Performance comparison in unperturbed environments for motion policies trained under different $L_{p}$.
  • Figure 5: SAP's attack ratio varies significantly across motion tasks. As the difficulty of the task increases, the attack rate gradually increases.
  • ...and 5 more figures