Table of Contents
Fetching ...

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

Andrea Tirinzoni, Ahmed Touati, Jesse Farebrother, Mateusz Guzek, Anssi Kanervisto, Yingchen Xu, Alessandro Lazaric, Matteo Pirotta

TL;DR

The paper tackles zero-shot, whole-body humanoid control by grounding unsupervised RL with unlabeled motion data through Forward-Backward representations and a latent-conditioned discriminator (FB-CPR). This yields a humanoid behavioral foundation model (Meta Motivo) trained on observation-only AMASS data to perform diverse tasks such as motion tracking, goal reaching, and reward optimization, without task-specific fine-tuning. Key contributions include the FB-CPR algorithm, a principled distribution-matching objective via a latent discriminator, and extensive humanoid experiments showing competitive performance and more human-like behaviors than reward-only baselines. The approach offers practical benefits for scalable, generalizable humanoid control while highlighting avenues for further theoretical understanding and data-driven extension (perception, planning, language alignment).

Abstract

Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. Despite recent advancements, existing approaches suffer from several limitations: they may require running an RL process on each downstream task to achieve a satisfactory performance, they may need access to datasets with good coverage or well-curated task-specific samples, or they may pre-train policies with unsupervised losses that are poorly correlated with the downstream tasks of interest. In this paper, we introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. The key technical novelty of our method, called Forward-Backward Representations with Conditional-Policy Regularization, is to train forward-backward representations to embed the unlabeled trajectories to the same latent space used to represent states, rewards, and policies, and use a latent-conditional discriminator to encourage policies to ``cover'' the states in the unlabeled behavior dataset. As a result, we can learn policies that are well aligned with the behaviors in the dataset, while retaining zero-shot generalization capabilities for reward-based and imitation tasks. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem: leveraging observation-only motion capture datasets, we train Meta Motivo, the first humanoid behavioral foundation model that can be prompted to solve a variety of whole-body tasks, including motion tracking, goal reaching, and reward optimization. The resulting model is capable of expressing human-like behaviors and it achieves competitive performance with task-specific methods while outperforming state-of-the-art unsupervised RL and model-based baselines.

Zero-Shot Whole-Body Humanoid Control via Behavioral Foundation Models

TL;DR

The paper tackles zero-shot, whole-body humanoid control by grounding unsupervised RL with unlabeled motion data through Forward-Backward representations and a latent-conditioned discriminator (FB-CPR). This yields a humanoid behavioral foundation model (Meta Motivo) trained on observation-only AMASS data to perform diverse tasks such as motion tracking, goal reaching, and reward optimization, without task-specific fine-tuning. Key contributions include the FB-CPR algorithm, a principled distribution-matching objective via a latent discriminator, and extensive humanoid experiments showing competitive performance and more human-like behaviors than reward-only baselines. The approach offers practical benefits for scalable, generalizable humanoid control while highlighting avenues for further theoretical understanding and data-driven extension (perception, planning, language alignment).

Abstract

Unsupervised reinforcement learning (RL) aims at pre-training agents that can solve a wide range of downstream tasks in complex environments. Despite recent advancements, existing approaches suffer from several limitations: they may require running an RL process on each downstream task to achieve a satisfactory performance, they may need access to datasets with good coverage or well-curated task-specific samples, or they may pre-train policies with unsupervised losses that are poorly correlated with the downstream tasks of interest. In this paper, we introduce a novel algorithm regularizing unsupervised RL towards imitating trajectories from unlabeled behavior datasets. The key technical novelty of our method, called Forward-Backward Representations with Conditional-Policy Regularization, is to train forward-backward representations to embed the unlabeled trajectories to the same latent space used to represent states, rewards, and policies, and use a latent-conditional discriminator to encourage policies to ``cover'' the states in the unlabeled behavior dataset. As a result, we can learn policies that are well aligned with the behaviors in the dataset, while retaining zero-shot generalization capabilities for reward-based and imitation tasks. We demonstrate the effectiveness of this new approach in a challenging humanoid control problem: leveraging observation-only motion capture datasets, we train Meta Motivo, the first humanoid behavioral foundation model that can be prompted to solve a variety of whole-body tasks, including motion tracking, goal reaching, and reward optimization. The resulting model is capable of expressing human-like behaviors and it achieves competitive performance with task-specific methods while outperforming state-of-the-art unsupervised RL and model-based baselines.

Paper Structure

This paper contains 58 sections, 25 equations, 22 figures, 28 tables, 1 algorithm.

Figures (22)

  • Figure 1: Meta Motivo is the first behavioral foundation model for humanoid agents that can solve whole-body control tasks such as tracking, pose-reaching, and reward optimization through zero-shot inference. Meta Motivo is trained with a novel unsupervised reinforcement learning algorithm regularizing zero-shot forward-backward policy learning with imitation of unlabeled motions.
  • Figure 2: Illustration of the main components of FB-CPR: the discriminator is trained to estimate the ratio between the latent-state distribution induced by policies $(\pi_z)$ and the unlabeled behavior dataset $\mathcal{M}$, where trajectories are embedded through $\textsc{ER}_{\mathrm{FB}}$. The policies are trained with a regularized loss combining a policy improvement objective based on the FB action value function and a critic trained on the discriminator. Finally, the learned policies are rolled out to collect samples that are stored into the replay buffer $\mathcal{D}_{\mathrm{online}}$.
  • Figure 3: Human-evaluation. Left figure reports the percentage of times a behavior solved a reward-based (blue) or a goal-reaching (pink) task (tasks are independently evaluated). Right figure reports the score for human-likeness by direct comparison of the two algorithms.
  • Figure 4: FB-CPR Ablations. (Top Left) Ablating the FB-CPR discriminator's policy conditioning. (Top Right) Ablating the contribution of $F(z)^\top z$ in the FB-CPR actor loss (Eq. \ref{['eq:fb.cpr.actor.loss']}). (Bottom Left) The effect of increasing model capacity along with the number of motions in the dataset $\mathcal{M}$. (Bottom Right) Contrasting Advantage-Weighed FB (FB-AW) trained from a large diverse offline dataset versus FB-CPR trained fully online with policy regularization. All ablations are averaged over $5$ seeds with ranges representing bootstrapped 95% confidence intervals.
  • Figure 5: Examples of the poses used for goal-based evaluation.
  • ...and 17 more figures