Table of Contents
Fetching ...

BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

Yitang Li, Zhengyi Luo, Tonghe Zhang, Cunxi Dai, Anssi Kanervisto, Andrea Tirinzoni, Haoyang Weng, Kris Kitani, Mateusz Guzek, Ahmed Touati, Alessandro Lazaric, Matteo Pirotta, Guanya Shi

TL;DR

BFM-Zero tackles the challenge of learning a single, promptable policy for humanoid whole-body control by embedding motions, goals, and rewards into a unified latent space $Z \subseteq \mathbb{R}^d$ learned via unsupervised reinforcement learning with Forward-Backward representations. The method combines offline motion data, online interaction, domain randomization, and asymmetric history-based training to produce a latent-conditioned policy $\pi_z$ capable of zero-shot task execution (tracking, pose reaching, reward optimization) and fast adaptation without retraining. Zero-shot performance is demonstrated on a real Unitree G1 with robust disturbance rejection and natural recovery, and a few-shot adaptation paradigm (CEM-based pose optimization, dual-annealing trajectory optimization) further improves performance under dynamics shifts. The work advances scalable, explainable, promptable behavioral foundation models for real-world humanoids and opens paths toward high-level planning and composition of skills.

Abstract

Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward optimization, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control.

BFM-Zero: A Promptable Behavioral Foundation Model for Humanoid Control Using Unsupervised Reinforcement Learning

TL;DR

BFM-Zero tackles the challenge of learning a single, promptable policy for humanoid whole-body control by embedding motions, goals, and rewards into a unified latent space learned via unsupervised reinforcement learning with Forward-Backward representations. The method combines offline motion data, online interaction, domain randomization, and asymmetric history-based training to produce a latent-conditioned policy capable of zero-shot task execution (tracking, pose reaching, reward optimization) and fast adaptation without retraining. Zero-shot performance is demonstrated on a real Unitree G1 with robust disturbance rejection and natural recovery, and a few-shot adaptation paradigm (CEM-based pose optimization, dual-annealing trajectory optimization) further improves performance under dynamics shifts. The work advances scalable, explainable, promptable behavioral foundation models for real-world humanoids and opens paths toward high-level planning and composition of skills.

Abstract

Building Behavioral Foundation Models (BFMs) for humanoid robots has the potential to unify diverse control tasks under a single, promptable generalist policy. However, existing approaches are either exclusively deployed on simulated humanoid characters, or specialized to specific tasks such as tracking. We propose BFM-Zero, a framework that learns an effective shared latent representation that embeds motions, goals, and rewards into a common space, enabling a single policy to be prompted for multiple downstream tasks without retraining. This well-structured latent space in BFM-Zero enables versatile and robust whole-body skills on a Unitree G1 humanoid in the real world, via diverse inference methods, including zero-shot motion tracking, goal reaching, and reward optimization, and few-shot optimization-based adaptation. Unlike prior on-policy reinforcement learning (RL) frameworks, BFM-Zero builds upon recent advancements in unsupervised RL and Forward-Backward (FB) models, which offer an objective-centric, explainable, and smooth latent representation of whole-body motions. We further extend BFM-Zero with critical reward shaping, domain randomization, and history-dependent asymmetric learning to bridge the sim-to-real gap. Those key design choices are quantitatively ablated in simulation. A first-of-its-kind model, BFM-Zero establishes a step toward scalable, promptable behavioral foundation models for whole-body humanoid control.

Paper Structure

This paper contains 24 sections, 10 equations, 15 figures, 3 tables, 1 algorithm.

Figures (15)

  • Figure 1: BFM-Zero enables versatile and robust whole-body skills. (A-C) Diverse zero-shot inference methods. (D) Natural recovery from large perturbation. (E) Few-shot adaptation.
  • Figure 2: An overview of the BFM-Zero framework. After the pre-training stage, BFM-Zero forms a latent space that can be used for zero-shot reward optimization, single-frame goal reaching, and tracking. It can also be adapted in a few-shot fashion to reach more challenging poses.
  • Figure 3: Tracking, reward, and goal-reaching performance across models for different testing configurations (left), and example distributions of reward evaluation scores for BFM-Zero in Isaac (DR) (right). Each metric is averaged over tasks. We consider the average return over episodes lasting 500 steps for reward, the average joint position error $E_{\mathrm{mpjpe}}$ averaged over the whole motion for tracking, and the error $E_{\mathrm{mpjpe}}$ averaged over the episode for goal-reaching.
  • Figure 4: Real-World Validation of Tracking. Left: Highly dynamic dancing. Middle: Frequently turning during walking. Right:Naturally recover to continue track the motion.
  • Figure 5: Real-World Validation of Goal Reaching. (a) Continuously goal-reaching: the blue/yellow pose denotes the goal pose, while black marks the real robot pose, and gray visualizes the transition between each pose. (b) Transition from any pose to T-pose.
  • ...and 10 more figures