Table of Contents
Fetching ...

Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation

Jing Yuan Luo, Yunlong Song, Victor Klemm, Fan Shi, Davide Scaramuzza, Marco Hutter

TL;DR

For quadruped locomotion, it is found that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL.

Abstract

First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-to-end on a quadruped in minutes.

Residual Policy Learning for Perceptive Quadruped Control Using Differentiable Simulation

TL;DR

For quadruped locomotion, it is found that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL.

Abstract

First-order Policy Gradient (FoPG) algorithms such as Backpropagation through Time and Analytical Policy Gradients leverage local simulation physics to accelerate policy search, significantly improving sample efficiency in robot control compared to standard model-free reinforcement learning. However, FoPG algorithms can exhibit poor learning dynamics in contact-rich tasks like locomotion. Previous approaches address this issue by alleviating contact dynamics via algorithmic or simulation innovations. In contrast, we propose guiding the policy search by learning a residual over a simple baseline policy. For quadruped locomotion, we find that the role of residual policy learning in FoPG-based training (FoPG RPL) is primarily to improve asymptotic rewards, compared to improving sample efficiency for model-free RL. Additionally, we provide insights on applying FoPG's to pixel-based local navigation, training a point-mass robot to convergence within seconds. Finally, we showcase the versatility of FoPG RPL by using it to train locomotion and perceptive navigation end-to-end on a quadruped in minutes.
Paper Structure (18 sections, 2 equations, 7 figures, 2 tables)

This paper contains 18 sections, 2 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: We apply differentiable simulation to learn A: walking, B: vision-based local navigation, C: walking around obstacles - tasks involving ground contact and depth rendering (depicted red). Analytical first-order policy gradients train A and C in minutes and B in seconds. https://m.youtube.com/watch?v=NcmkAH_nwvw, https://github.com/google-deepmind/mujoco/blob/main/mjx/training_apg.ipynb
  • Figure 2: Policy architecture for pixel to joint-angle perceptive navigation. To guide learning the final policy, we use an anchor policy with frozen weights (top) that is trained to nominally trot in place, with MLP layers of dimensions [256, 128]. Its output anchor actions are both observed by the learned policy and added to the learned policy's outputs, similar to RPL silverResidualPolicyLearning2019johanninkResidualReinforcementLearning2019a. We pre-process the anchor actions alongside proprioceptive inputs with a dedicated MLP of dimensions [128, 64] before concatenating the flattened 16x12 depth camera input and applying a 128-neuron dense layer.
  • Figure 3: The Pinnochio Trick. Green denotes forward simulation, while yellow indicates gradient flow from the collision event. A: BPTT policy gradient calculation upon collision: the reward signal propagates back from the current state (highlighted in yellow) to actions up to $H$ time steps prior, allowing maneuvers to be learned within this length-$H$ window. Gradients increase in magnitude due to repeated multiplication with the simulation Jacobian. B: To extend the maneuver window without enlarging Jacobian sizes, we penalize collisions with a virtual nose rigidly fixed to the robot's center of mass.
  • Figure 4: Learning blind locomotion. Residual Policy Learning increases asymptotic rewards for SHAC and PPO by 60% and 25% respectively while also improving the latter's sample efficiency by 6x. We plot PPO on a re-scaled y-axis for visualisation.
  • Figure 5: Obstacle avoidance on a point mass robot. Training the local navigation problem from Section \ref{['section:localnavdets']} using SHAC, we test training convergence when adding differentiable rendering, removing the Pinnochio Trick or learning on a full-scale vision setup with roughly 1000x more policy parameters. We also train using vanilla truncated BPTTsongLearningQuadrupedLocomotion2024, converging similarly to SHAC within 2e5 samples; under two seconds wall-clock time. PPO struggles to converge compared to the FoPG methods. Note*: We run PPO for 1.5e8 steps and downscale the x-axis by 820 for visualization.
  • ...and 2 more figures