Table of Contents
Fetching ...

Universal Humanoid Motion Representations for Physics-Based Control

Zhengyi Luo, Jinkun Cao, Josh Merel, Alexander Winkler, Jing Huang, Kris Kitani, Weipeng Xu

TL;DR

This paper introduces PULSE, a universal, physics-grounded motion representation for humanoid control. It achieves broad motion coverage by distilling a large-scale motion imitator into a probabilistic latent space using a variational information bottleneck and a proprioception-conditioned prior, enabling long-horizon, realistic motion. A learnable prior and residual action formulation facilitate efficient hierarchical RL, allowing the latent space to drive diverse generative and motion-tracking tasks with faster training and human-like behavior. Experiments on AMASS-derived data and VR-tracking scenarios demonstrate improved imitation quality, generation diversity, and sample efficiency compared to state-of-the-art latent-space methods, while maintaining physical plausibility. This work positions PULSE as a foundation model for control, enabling scalable reuse of rich motor skills across a wide range of humanoid tasks.

Abstract

We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high dimensionality of humanoids and the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers their applicability in complex tasks. We close this gap by significantly increasing the coverage of our motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved by using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. By sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using human-like behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.

Universal Humanoid Motion Representations for Physics-Based Control

TL;DR

This paper introduces PULSE, a universal, physics-grounded motion representation for humanoid control. It achieves broad motion coverage by distilling a large-scale motion imitator into a probabilistic latent space using a variational information bottleneck and a proprioception-conditioned prior, enabling long-horizon, realistic motion. A learnable prior and residual action formulation facilitate efficient hierarchical RL, allowing the latent space to drive diverse generative and motion-tracking tasks with faster training and human-like behavior. Experiments on AMASS-derived data and VR-tracking scenarios demonstrate improved imitation quality, generation diversity, and sample efficiency compared to state-of-the-art latent-space methods, while maintaining physical plausibility. This work positions PULSE as a foundation model for control, enabling scalable reuse of rich motor skills across a wide range of humanoid tasks.

Abstract

We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high dimensionality of humanoids and the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers their applicability in complex tasks. We close this gap by significantly increasing the coverage of our motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved by using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. By sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using human-like behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks (e.g. strike, terrain traversal) and motion tracking using VR controllers.
Paper Structure (45 sections, 5 equations, 6 figures, 6 tables, 1 algorithm)

This paper contains 45 sections, 5 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: We propose to learn a motion representation that can be reused universally by downstream tasks. From left to right: speed, strike target, complex terrain traversal, and VR controller tracking.
  • Figure 2: We form our latent space by directly distilling from a pretrained motion imitator that can imitate all of the motion sequences from a large-scale dataset. A variational information bottleneck is used to model the distribution of motor skills conditioned on proprioception. After training the latent space model, the decoder $\boldsymbol{\mathcal{D}}$ and prior $\boldsymbol{\mathcal{R}}$ are frozen and used for downsteam tasks.
  • Figure 3: (a, b, c, d) Policy trained using our motion representation solves tasks with human-like behavior. (e) Our latent space is not constrained to certain movement styles and support free-form tracking. (f) Random sampling from learned prior $\boldsymbol{\mathcal{R}}$ leads to human-like movements as well as the recovery from fallen state.
  • Figure 4: Training curves for each one of the generative tasks. Using our motion representation improves training speed and performance as the task policy can explore in an informative latent space. Experiments run for 3 times using different random seeds.
  • Figure 5: Success rate comparison between training from scratch and using our motion representation during training. Ours converges faster.
  • ...and 1 more figures