Table of Contents
Fetching ...

An Efficient Model-Based Approach on Learning Agile Motor Skills without Reinforcement

Haojie Shi, Tingguang Li, Qingxu Zhu, Jiapeng Sheng, Lei Han, Max Q. -H. Meng

TL;DR

This work tackles the sim-to-real gap and data inefficiency in quadrupedal locomotion by learning a differentiable world model to predict future states and supervising a VAE-based policy that imitates real animal trajectories. The methodology employs a two-stage, supervised training regime: learn a predictive world model with $n$-step dynamics and then train a motion-tracking policy and a command-following latent space via structured VAEs, with end-to-end backpropagation enabled by differentiable dynamics. Real-world fine-tuning is performed with a regularization term to preserve prior behavior, enabling rapid adaptation in about two minutes of data and demonstrating robust generalization to unseen speeds and paths. Experiments show more than a tenfold improvement in sample efficiency over PPO in simulation and effective two-minute adaptation on a real quadruped, highlighting practical impact for deploying agile motor skills with minimal real-world data. The work suggests future extensions to perception-enhanced world models for visual locomotion and further reductions in the sim2real gap.

Abstract

Learning-based methods have improved locomotion skills of quadruped robots through deep reinforcement learning. However, the sim-to-real gap and low sample efficiency still limit the skill transfer. To address this issue, we propose an efficient model-based learning framework that combines a world model with a policy network. We train a differentiable world model to predict future states and use it to directly supervise a Variational Autoencoder (VAE)-based policy network to imitate real animal behaviors. This significantly reduces the need for real interaction data and allows for rapid policy updates. We also develop a high-level network to track diverse commands and trajectories. Our simulated results show a tenfold sample efficiency increase compared to reinforcement learning methods such as PPO. In real-world testing, our policy achieves proficient command-following performance with only a two-minute data collection period and generalizes well to new speeds and paths.

An Efficient Model-Based Approach on Learning Agile Motor Skills without Reinforcement

TL;DR

This work tackles the sim-to-real gap and data inefficiency in quadrupedal locomotion by learning a differentiable world model to predict future states and supervising a VAE-based policy that imitates real animal trajectories. The methodology employs a two-stage, supervised training regime: learn a predictive world model with -step dynamics and then train a motion-tracking policy and a command-following latent space via structured VAEs, with end-to-end backpropagation enabled by differentiable dynamics. Real-world fine-tuning is performed with a regularization term to preserve prior behavior, enabling rapid adaptation in about two minutes of data and demonstrating robust generalization to unseen speeds and paths. Experiments show more than a tenfold improvement in sample efficiency over PPO in simulation and effective two-minute adaptation on a real quadruped, highlighting practical impact for deploying agile motor skills with minimal real-world data. The work suggests future extensions to perception-enhanced world models for visual locomotion and further reductions in the sim2real gap.

Abstract

Learning-based methods have improved locomotion skills of quadruped robots through deep reinforcement learning. However, the sim-to-real gap and low sample efficiency still limit the skill transfer. To address this issue, we propose an efficient model-based learning framework that combines a world model with a policy network. We train a differentiable world model to predict future states and use it to directly supervise a Variational Autoencoder (VAE)-based policy network to imitate real animal behaviors. This significantly reduces the need for real interaction data and allows for rapid policy updates. We also develop a high-level network to track diverse commands and trajectories. Our simulated results show a tenfold sample efficiency increase compared to reinforcement learning methods such as PPO. In real-world testing, our policy achieves proficient command-following performance with only a two-minute data collection period and generalizes well to new speeds and paths.
Paper Structure (11 sections, 10 equations, 7 figures, 3 tables, 1 algorithm)

This paper contains 11 sections, 10 equations, 7 figures, 3 tables, 1 algorithm.

Figures (7)

  • Figure 1: Our robot Max follows the U-shape path after fine-tuned in the real world.
  • Figure 2: Overview of our learning framework. The gray block represents fixed parameters. For the command following task, the Motor Decoder is fixed when training from scratch and becomes trainable during real-world fine-tuning.
  • Figure 3: Four types of desired paths. The red star represents the starting point.
  • Figure 4: (a) Training curves of the motion tracking task in the simulation. (b) Training curves of fine-tuning the motion tracking task policy in the modified simulation environment. (c) Mean loss of fine-tuning the path following policy in three workloads within the modified simulation environment. (d) Mean loss of fine-tuning the path following policy under various speeds on the real robot.
  • Figure 5: Speed following at 1.2 m/s along the oblong path on the real robot with real-world adaptation.
  • ...and 2 more figures