An Efficient Model-Based Approach on Learning Agile Motor Skills without Reinforcement
Haojie Shi, Tingguang Li, Qingxu Zhu, Jiapeng Sheng, Lei Han, Max Q. -H. Meng
TL;DR
This work tackles the sim-to-real gap and data inefficiency in quadrupedal locomotion by learning a differentiable world model to predict future states and supervising a VAE-based policy that imitates real animal trajectories. The methodology employs a two-stage, supervised training regime: learn a predictive world model with $n$-step dynamics and then train a motion-tracking policy and a command-following latent space via structured VAEs, with end-to-end backpropagation enabled by differentiable dynamics. Real-world fine-tuning is performed with a regularization term to preserve prior behavior, enabling rapid adaptation in about two minutes of data and demonstrating robust generalization to unseen speeds and paths. Experiments show more than a tenfold improvement in sample efficiency over PPO in simulation and effective two-minute adaptation on a real quadruped, highlighting practical impact for deploying agile motor skills with minimal real-world data. The work suggests future extensions to perception-enhanced world models for visual locomotion and further reductions in the sim2real gap.
Abstract
Learning-based methods have improved locomotion skills of quadruped robots through deep reinforcement learning. However, the sim-to-real gap and low sample efficiency still limit the skill transfer. To address this issue, we propose an efficient model-based learning framework that combines a world model with a policy network. We train a differentiable world model to predict future states and use it to directly supervise a Variational Autoencoder (VAE)-based policy network to imitate real animal behaviors. This significantly reduces the need for real interaction data and allows for rapid policy updates. We also develop a high-level network to track diverse commands and trajectories. Our simulated results show a tenfold sample efficiency increase compared to reinforcement learning methods such as PPO. In real-world testing, our policy achieves proficient command-following performance with only a two-minute data collection period and generalizes well to new speeds and paths.
