Table of Contents
Fetching ...

DecAP: Decaying Action Priors for Accelerated Imitation Learning of Torque-Based Legged Locomotion Policies

Shivam Sood, Ge Sun, Peizhuo Li, Guillaume Sartoretti

TL;DR

The paper tackles the sample-inefficiency of torque-space learning for legged locomotion by proposing DecAP, a two-stage framework that first learns from position-based imitation data and then injects decaying torque priors to bootstrap torque exploration. It formalizes the problem as an MDP for velocity tracking, uses imitation rewards and shaping terms, and decays a PD-based torque bias over time to guide exploration ($\gamma$ = 0.99, $k$ = 100). The approach is validated in simulation on three quadrupeds and in hardware with a Unitree Go1, showing faster convergence (≈25 minutes) and robustness to imitation reward scaling from $0.1x$ to $10x$, with torque-based policies outperforming pure imitation in disturbed or out-of-distribution conditions. The results suggest torque-based control can be learned end-to-end more efficiently by leveraging position-space data and controlled exploration, enabling robust real-world locomotion without extensive domain randomization and generalizing across platforms.

Abstract

Optimal Control for legged robots has gone through a paradigm shift from position-based to torque-based control, owing to the latter's compliant and robust nature. In parallel to this shift, the community has also turned to Deep Reinforcement Learning (DRL) as a promising approach to directly learn locomotion policies for complex real-life tasks. However, most end-to-end DRL approaches still operate in position space, mainly because learning in torque space is often sample-inefficient and does not consistently converge to natural gaits. To address these challenges, we propose a two-stage framework. In the first stage, we generate our own imitation data by training a position-based policy, eliminating the need for expert knowledge to design optimal controllers. The second stage incorporates decaying action priors, a novel method to enhance the exploration of torque-based policies aided by imitation rewards. We show that our approach consistently outperforms imitation learning alone and is robust to scaling these rewards from 0.1x to 10x. We further validate the benefits of torque control by comparing the robustness of a position-based policy to a position-assisted torque-based policy on a quadruped (Unitree Go1) without any domain randomization in the form of external disturbances during training.

DecAP: Decaying Action Priors for Accelerated Imitation Learning of Torque-Based Legged Locomotion Policies

TL;DR

The paper tackles the sample-inefficiency of torque-space learning for legged locomotion by proposing DecAP, a two-stage framework that first learns from position-based imitation data and then injects decaying torque priors to bootstrap torque exploration. It formalizes the problem as an MDP for velocity tracking, uses imitation rewards and shaping terms, and decays a PD-based torque bias over time to guide exploration ( = 0.99, = 100). The approach is validated in simulation on three quadrupeds and in hardware with a Unitree Go1, showing faster convergence (≈25 minutes) and robustness to imitation reward scaling from to , with torque-based policies outperforming pure imitation in disturbed or out-of-distribution conditions. The results suggest torque-based control can be learned end-to-end more efficiently by leveraging position-space data and controlled exploration, enabling robust real-world locomotion without extensive domain randomization and generalizing across platforms.

Abstract

Optimal Control for legged robots has gone through a paradigm shift from position-based to torque-based control, owing to the latter's compliant and robust nature. In parallel to this shift, the community has also turned to Deep Reinforcement Learning (DRL) as a promising approach to directly learn locomotion policies for complex real-life tasks. However, most end-to-end DRL approaches still operate in position space, mainly because learning in torque space is often sample-inefficient and does not consistently converge to natural gaits. To address these challenges, we propose a two-stage framework. In the first stage, we generate our own imitation data by training a position-based policy, eliminating the need for expert knowledge to design optimal controllers. The second stage incorporates decaying action priors, a novel method to enhance the exploration of torque-based policies aided by imitation rewards. We show that our approach consistently outperforms imitation learning alone and is robust to scaling these rewards from 0.1x to 10x. We further validate the benefits of torque control by comparing the robustness of a position-based policy to a position-assisted torque-based policy on a quadruped (Unitree Go1) without any domain randomization in the form of external disturbances during training.
Paper Structure (19 sections, 4 equations, 7 figures, 2 tables)

This paper contains 19 sections, 4 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: (Top) Our position-assisted, torque-based policy, can successfully help the robot navigate real-life uneven terrain, despite having been trained on flat ground without any external force disturbances. (Bottom) Our torque-based velocity tracking policies for Agility-Cassie (left), Unitree-Go1 (middle), and Hebi-Daisy (right), achieve high-quality gaits within just 25 minutes of wall-clock time.$^\textrm{4}$
  • Figure 2: Overview of the proposed torque learning framework: First, we train a position-based policy $\pi_q$, to acquire offline position imitation data ($\hat{x}_t$) for robot state ($x_t$), which is incorporated into the reward structure while training the torque-based policy. At the same time, we augment the sampled actions ($\mathcal{A}'$) with a torque bias ($PID(\hat{q}_t-q_t)$). This torque bias, calculated from the joint-position imitation data, guides the initial actions for faster convergence and is multiplied by a gradual time decay factor $\gamma^{t/k}$. Finally, after the torque bias becomes negligible and the torque-based policy's actions alone are sufficient to operate the robot, we deploy these torques along with a low-gain PD controller to send the final torques ($\tau_t$) to the robot actuators.
  • Figure 3: The position-based policy generates different action outputs depending on the PID gains, while the tracked angles of the robot in simulation remain relatively stable, which makes them better suited for imitation
  • Figure 4: Comparing RMSE (in radians) between the simulated robot's tracked angles and reference imitation angles at different reward weights, DecAP + Imitation consistently outperforms imitation alone when learning in torque space. Relying on position imitation data alone yields unnatural gaits, evident from the huge deviations from reference imitation angles.
  • Figure 5: Comparing the reward progression over time for various legged robots, we observe that the vanishing of DecAP indicates the action priors have become insignificant. This suggests that the actions sampled directly from the policy are now adequate for controlling the robot.
  • ...and 2 more figures