Table of Contents
Fetching ...

TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control

Zifeng Zhuang, Diyuan Shi, Runze Suo, Xiao He, Hongyin Zhang, Ting Wang, Shangke Lyu, Donglin Wang

TL;DR

The paper tackles the sample-efficiency challenge of reinforcing control for humanoid robots with dexterous hands in high-dimensional spaces. It introduces Self-Imitative Reinforcement Learning (SIRL), which augments a model-based RL method (TD-MPC2) with a self-imitation term in the policy loss, where the imitation weight is a function of the trajectory return $R_t$ and a reference $G$, enabling the agent to prioritize upright postures critical for downstream tasks. Empirically, TDMPBC (TD-MPC2 plus BC) achieves about a 120% improvement in normalized return on HumanoidBench with only ~5% extra computation and can solve 8 of 14 locomotion tasks at 2M steps, albeit with ongoing challenges in simultaneous whole-body manipulation. The work suggests that online imitation from self-generated high-return trajectories can substantially boost sample efficiency and guide upright-learning as a foundation for more capable humanoid control, with avenues for real-world deployment and further manipulation tasks.

Abstract

Complex high-dimensional spaces with high Degree-of-Freedom and complicated action spaces, such as humanoid robots equipped with dexterous hands, pose significant challenges for reinforcement learning (RL) algorithms, which need to wisely balance exploration and exploitation under limited sample budgets. In general, feasible regions for accomplishing tasks within complex high-dimensional spaces are exceedingly narrow. For instance, in the context of humanoid robot motion control, the vast majority of space corresponds to falling, while only a minuscule fraction corresponds to standing upright, which is conducive to the completion of downstream tasks. Once the robot explores into a potentially task-relevant region, it should place greater emphasis on the data within that region. Building on this insight, we propose the $\textbf{S}$elf-$\textbf{I}$mitative $\textbf{R}$einforcement $\textbf{L}$earning ($\textbf{SIRL}$) framework, where the RL algorithm also imitates potentially task-relevant trajectories. Specifically, trajectory return is utilized to determine its relevance to the task and an additional behavior cloning is adopted whose weight is dynamically adjusted based on the trajectory return. As a result, our proposed algorithm achieves 120% performance improvement on the challenging HumanoidBench with 5% extra computation overhead. With further visualization, we find the significant performance gain does lead to meaningful behavior improvement that several tasks are solved successfully.

TDMPBC: Self-Imitative Reinforcement Learning for Humanoid Robot Control

TL;DR

The paper tackles the sample-efficiency challenge of reinforcing control for humanoid robots with dexterous hands in high-dimensional spaces. It introduces Self-Imitative Reinforcement Learning (SIRL), which augments a model-based RL method (TD-MPC2) with a self-imitation term in the policy loss, where the imitation weight is a function of the trajectory return and a reference , enabling the agent to prioritize upright postures critical for downstream tasks. Empirically, TDMPBC (TD-MPC2 plus BC) achieves about a 120% improvement in normalized return on HumanoidBench with only ~5% extra computation and can solve 8 of 14 locomotion tasks at 2M steps, albeit with ongoing challenges in simultaneous whole-body manipulation. The work suggests that online imitation from self-generated high-return trajectories can substantially boost sample efficiency and guide upright-learning as a foundation for more capable humanoid control, with avenues for real-world deployment and further manipulation tasks.

Abstract

Complex high-dimensional spaces with high Degree-of-Freedom and complicated action spaces, such as humanoid robots equipped with dexterous hands, pose significant challenges for reinforcement learning (RL) algorithms, which need to wisely balance exploration and exploitation under limited sample budgets. In general, feasible regions for accomplishing tasks within complex high-dimensional spaces are exceedingly narrow. For instance, in the context of humanoid robot motion control, the vast majority of space corresponds to falling, while only a minuscule fraction corresponds to standing upright, which is conducive to the completion of downstream tasks. Once the robot explores into a potentially task-relevant region, it should place greater emphasis on the data within that region. Building on this insight, we propose the elf-mitative einforcement earning () framework, where the RL algorithm also imitates potentially task-relevant trajectories. Specifically, trajectory return is utilized to determine its relevance to the task and an additional behavior cloning is adopted whose weight is dynamically adjusted based on the trajectory return. As a result, our proposed algorithm achieves 120% performance improvement on the challenging HumanoidBench with 5% extra computation overhead. With further visualization, we find the significant performance gain does lead to meaningful behavior improvement that several tasks are solved successfully.

Paper Structure

This paper contains 30 sections, 9 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: Tasks accomplished by TDMPBC: 1) navigating through pole-filled areas by staying close to the wall, 2) maintain balance on unstable board with the spherical pivot beneath the board in motion, 3) window cleaning with arm-controled cleaning tools and 4) achieving a successful basketball shot.
  • Figure 2: Performance of TDMPBC with 2M interaction steps compared to the baselines TD-MPC2 with 2M, DreamerV3 with 10M and SAC with 10M on HumanoidBench.
  • Figure 3: The first row presents the return mean of the trajectories obtained by evaluating the policy trained for 100000 steps on the task run, along with the violin plot distribution of $r_{\text{upright}}$ across all timesteps. The second row shows the results obtained after training for 300000 steps.
  • Figure 4: This figure presents the evaluation results on the HumanoidBench, where we conduct experiments with a total of three seeds and the shaded area representing one standard deviation. The baseline results are directly from the HumanoidBench.
  • Figure 5: The left figures illustrate the impact of different hyperparameter values ($\beta = 0.5, 1.0, 2.0$) on the performance of TDMPBC across three tasks: run, hurdle, and maze. The right figures demonstrate the effects of two different goal settings ($G = R_{\text{max}}$ and $G = R_{\text{target}}$) on the performance of TDMPBC across three tasks: reach, hurdle, and maze.
  • ...and 6 more figures