Table of Contents
Fetching ...

Latent Action Priors for Locomotion with Deep Reinforcement Learning

Oliver Hausdörfer, Alexander von Rohr, Éric Lefort, Angela Schoellig

TL;DR

The paper addresses the brittleness of deep reinforcement learning for locomotion under direct torque control by introducing latent action priors learned from a small set of expert demonstrations. A nonlinear autoencoder compresses expert actions into a low-dimensional latent space $a_l$, which is fixed during PPO-based DRL and used to generate decoded actions with a residual full-action component; an imitation-style reward further guides learning. Empirical results across diverse robots and tasks show substantial gains in sample efficiency and final performance, with latent priors facilitating transfer and even enabling gait transitions, especially when paired with style rewards. The work highlights the practical utility of data-efficient, action-space priors for torque-controlled locomotion and suggests broad applicability to imitation learning and multi-gait scenarios, including potential extensions to video-derived demonstrations.

Abstract

Deep Reinforcement Learning (DRL) enables robots to learn complex behaviors through interaction with the environment. However, due to the unrestricted nature of the learning algorithms, the resulting solutions are often brittle and appear unnatural. This is especially true for learning direct joint-level torque control, as inductive biases are difficult to integrate into the learning process. We propose an inductive bias for learning locomotion that is especially useful for torque control: latent actions learned from a small dataset of expert demonstrations. This prior allows the policy to directly leverage knowledge contained in the expert's actions and facilitates more efficient exploration. We observe that the agent is not restricted to the reward levels of the demonstration, and performance in transfer tasks is improved significantly. Latent action priors combined with style rewards for imitation lead to a closer replication of the expert's behavior. Videos and code are available at https://sites.google.com/view/latent-action-priors.

Latent Action Priors for Locomotion with Deep Reinforcement Learning

TL;DR

The paper addresses the brittleness of deep reinforcement learning for locomotion under direct torque control by introducing latent action priors learned from a small set of expert demonstrations. A nonlinear autoencoder compresses expert actions into a low-dimensional latent space , which is fixed during PPO-based DRL and used to generate decoded actions with a residual full-action component; an imitation-style reward further guides learning. Empirical results across diverse robots and tasks show substantial gains in sample efficiency and final performance, with latent priors facilitating transfer and even enabling gait transitions, especially when paired with style rewards. The work highlights the practical utility of data-efficient, action-space priors for torque-controlled locomotion and suggests broad applicability to imitation learning and multi-gait scenarios, including potential extensions to video-derived demonstrations.

Abstract

Deep Reinforcement Learning (DRL) enables robots to learn complex behaviors through interaction with the environment. However, due to the unrestricted nature of the learning algorithms, the resulting solutions are often brittle and appear unnatural. This is especially true for learning direct joint-level torque control, as inductive biases are difficult to integrate into the learning process. We propose an inductive bias for learning locomotion that is especially useful for torque control: latent actions learned from a small dataset of expert demonstrations. This prior allows the policy to directly leverage knowledge contained in the expert's actions and facilitates more efficient exploration. We observe that the agent is not restricted to the reward levels of the demonstration, and performance in transfer tasks is improved significantly. Latent action priors combined with style rewards for imitation lead to a closer replication of the expert's behavior. Videos and code are available at https://sites.google.com/view/latent-action-priors.
Paper Structure (6 sections, 4 equations, 10 figures, 1 table)

This paper contains 6 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: We propose a latent action space representation for Deep Reinforcement Learning (DRL) that is especially useful for learning locomotion in direct torque control. Based on a small dataset of expert demonstrations, we learn a representation of the expert's actions that is subsequently used as an action prior in DRL. The decoded latent actions $\hat{a}$ and a residual of the full action space $a_{\mathrm{full}}$ are applied to the robot. During DRL training the latent action decoder is fixed. We combine our approach with an imitation style reward from Peng.2018 based on the same expert data.
  • Figure 2: We evaluate our method on three Gymnasium Mujoco towers2024gymnasium benchmarks (a) - (c), as well as simulated robot models of Unitree A1 quadruped (d) and Unitree H1 humanoid (e) from the loco-mujoco benchmark AlHafez.2023. For the gymnasium environments the task is to maximize forward velocity. For the loco-mujoco environments, the task is to follow the expert's velocity. (f) Additionally, we introduce a new complex environment where two Unitree A1s jointly solve the task of moving a structure to a target position.
  • Figure 3: Principle components (PCs) of the torque actions in the expert demonstrations for the environments. PCs left of the vertical red line explain $>97\%$ variance and the black lines show the cumulative explained variance.
  • Figure 4: Results training with PPO, PPO with style reward (PPO+style), latent actions priors (PPO+latent), residual RL (PPO+resRL+style), and reference Torque (PPO+refTorque+style) for Unitree A1. All policies are trined in direct torque control. Behavioral Cloning did not perform better than PPO, and thus, the results in this figure were omitted for readability. Reported are the task rewards. We run all experiments on five seeds and report the means and standard deviations.
  • Figure 5: Task rewards and standard deviations (error bars) for Unitree A1 in transfer tasks. The style rewards and action priors are the same for all experiments. For the transfer tasks, the Unitree A1 has to move faster than seen in the expert demonstration and in any target direction.
  • ...and 5 more figures