$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Songlin Wei; Hongyi Jing; Boqian Li; Zhenyu Zhao; Jiageng Mao; Zhenhao Ni; Sicheng He; Jie Liu; Xiawei Liu; Kaidi Kang; Sheng Zang; Weiduo Yuan; Marco Pavone; Di Huang; Yue Wang

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Songlin Wei, Hongyi Jing, Boqian Li, Zhenyu Zhao, Jiageng Mao, Zhenhao Ni, Sicheng He, Jie Liu, Xiawei Liu, Kaidi Kang, Sheng Zang, Weiduo Yuan, Marco Pavone, Di Huang, Yue Wang

Abstract

We introduce $Ψ_0$ (Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10$\times$ as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Abstract

We introduce

(Psi-Zero), an open foundation model to address challenging humanoid loco-manipulation tasks. While existing approaches often attempt to address this fundamental problem by co-training on large and diverse human and humanoid data, we argue that this strategy is suboptimal due to the fundamental kinematic and motion disparities between humans and humanoid robots. Therefore, data efficiency and model performance remain unsatisfactory despite the considerable data volume. To address this challenge, \ours\;decouples the learning process to maximize the utility of heterogeneous data sources. Specifically, we propose a staged training paradigm with different learning objectives: First, we autoregressively pre-train a VLM backbone on large-scale egocentric human videos to acquire generalizable visual-action representations. Then, we post-train a flow-based action expert on high-quality humanoid robot data to learn precise robot joint control. Our research further identifies a critical yet often overlooked data recipe: in contrast to approaches that scale with noisy Internet clips or heterogeneous cross-embodiment robot datasets, we demonstrate that pre-training on high-quality egocentric human manipulation data followed by post-training on domain-specific real-world humanoid trajectories yields superior performance. Extensive real-world experiments demonstrate that \ours\ achieves the best performance using only about 800 hours of human video data and 30 hours of real-world robot data, outperforming baselines pre-trained on more than 10

as much data by over 40\% in overall success rate across multiple tasks. We will open-source the entire ecosystem to the community, including a data processing and training pipeline, a humanoid foundation model, and a real-time action inference engine.

Paper Structure (55 sections, 2 equations, 11 figures, 6 tables)

This paper contains 55 sections, 2 equations, 11 figures, 6 tables.

Introduction
Related Works
Whole-Body Dexterous Manipulation
Humanoid VLAs
Learning From Egocentric Videos
The $\Psi_{0}$ Foundation Model
Model Architecture
Training Recipe
Pre-Training on Egocentric Human Video
Post-Training on Cross-Task Real Humanoid Data
Fine-Tuning on In-domain Teleoperation Data
Real-Time Action Chunking
Tailoring Teleoperation for Loco-Manipulation
Experiments
Implementation
...and 40 more sections

Figures (11)

Figure 1: Humanoid Loco-Manipulation.$\Psi_{0}$ performs diverse loco-manipulation tasks in a pantry, including taking a cup from the coffee machine, pushing a cart, wiping the table, grasping a bottle and placing it in the sink, and pushing the fridge door.
Figure 2: Model Training and Deployment: First, we pre-train the VLM on the EgoDex egodex dataset to autoregressively predict the next-action tokens in the task space. Then, we post-train the flow-based action expert using robotic data to predict action chunks in the joint space. Finally, we implement a real-time chunking mechanism that leverages the lower-body controller to achieve smooth whole-body control.
Figure 3: MM-DiT for VLA: Comparison of MM-DiT architecture with naive DiT. $\tau$ is the flow timestep and VL and A denotes hidden states of the vision-language and action respectively.
Figure 4: Real-Time Chunking: Given that the previous action is being executed (yellow line), the next action can diverge significantly (cyan line) without RTC, which leads to control jitter. With RTC (red line), the divergence between two consecutive actions is strongly suppressed, resulting in smoother and more stable behavior.
Figure 5: Real-Robot Teleoperation Setup: We use MANUS gloves for dexterous hand retargeting; a VR headset and wrist trackers capture upper-body poses for inverse kinematics, while waist and foot trackers provide high-level locomotion commands.
...and 6 more figures

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Abstract

$Ψ_0$: An Open Foundation Model Towards Universal Humanoid Loco-Manipulation

Authors

Abstract

Table of Contents

Figures (11)