Table of Contents
Fetching ...

Learning Quadruped Walking from Seconds of Demonstration

Ruipeng Zhang, Hongzhan Yu, Ya-Chien Chang, Chenghao Li, Henrik I. Christensen, Sicun Gao

TL;DR

A principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincar\'e return maps, and local numerical properties of neural networks.

Abstract

Quadruped locomotion provides a natural setting for understanding when model-free learning can outperform model-based control design, by exploiting data patterns to bypass the difficulty of optimizing over discrete contacts and the combinatorial explosion of mode changes. We give a principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincaré return maps, and local numerical properties of neural networks. The understanding motivates a new imitation learning method that regulates the alignment between variations in a latent space and those over the output actions. Hardware experiments confirm that a few seconds of demonstration is sufficient to train various locomotion policies from scratch entirely offline with reasonable robustness.

Learning Quadruped Walking from Seconds of Demonstration

TL;DR

A principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincar\'e return maps, and local numerical properties of neural networks.

Abstract

Quadruped locomotion provides a natural setting for understanding when model-free learning can outperform model-based control design, by exploiting data patterns to bypass the difficulty of optimizing over discrete contacts and the combinatorial explosion of mode changes. We give a principled analysis of why imitation learning with quadrupeds can be inherently effective in a small data regime, based on the structure of its limit cycles, Poincaré return maps, and local numerical properties of neural networks. The understanding motivates a new imitation learning method that regulates the alignment between variations in a latent space and those over the output actions. Hardware experiments confirm that a few seconds of demonstration is sufficient to train various locomotion policies from scratch entirely offline with reasonable robustness.
Paper Structure (17 sections, 18 equations, 7 figures)

This paper contains 17 sections, 18 equations, 7 figures.

Figures (7)

  • Figure 1: Overview of our approach. We consider the offline imitation learning that collects a small batch of expert demonstration data, and then train deep neural network policies only from the batch without finetuning in simulation or on hardware. We propose new imitation methods with Latent Variation Regularization (LVR) that enforces the matching of the local structure between the control problem and the neural networks. The trained policies are directly tested on hardware platforms with varying ground conditions.
  • Figure 2: Illustration of the first-order variation requirement in neural policies. Around expert trajectories of stable walking, the local stabilizing control laws are linear at both continuous (analyzed through trajectory stabilization) and discrete jump states (analyzed through Poincaré sections). This structure matches with the local smooth pieces in deep neural networks that are approximately independent because of sparsity in the large parameter space. Thus the local feedback requirements can be readily enforced through imitation learning that regularize local variations in the latent space.
  • Figure 3: Real-world deployment of quadruped policies trained with minimal expert data. The upper left shows expert demonstration. Using the same data, behavior cloning (bottom left) fails to walk, whereas our Latent Variation Regularization (LVR, right column) produces stable forward and sideways walking. LVR further shows robust performance, such as walking backwards on grass by training on data from walking backwards on flat indoor ground.
  • Figure 4: Left: training pointwise imitation loss, where both BC and LVR rapidly converge to similar values. Right: control performance on forward and sideways walking with varying dataset sizes. LVR achieves expert-level performance with $\leq$ 1 trajectory, whereas BC requires substantially more demonstrations to approach similar returns. Purple line denotes the expert policy performance.
  • Figure 5: PCA visualization of latent states $h_t$, where projection axes are obtained from PCA on consecutive differences $\delta h_t=h_{t+1}-h_t$. Points are connected in temporal order and colored by cosine similarity of $\delta h_t$ with the first principal component (PC1). Left: expert forward-walking trajectory showing structured trot dynamics with two alternating gait modes. Right: an OOD rollout where BC (top) collapses while LVR (bottom) preserves coherent orientation bundles and separates ill-conditioned late states. See details in Section \ref{['sec:latent']}.
  • ...and 2 more figures