Table of Contents
Fetching ...

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

Xiaopeng Lin, Shijie Lian, Bin Yu, Ruoqi Yang, Changti Wu, Yuzhuo Miao, Yurun Jin, Yukun Shi, Cong Huang, Bojun Cheng, Kai Chen

TL;DR

<3-5 sentence high-level summary> The paper tackles robotic generalization by addressing the viewpoint gap between third-person Vision-Language Models and humanoid robots. It introduces the Egocentric2Embodiment translation pipeline to convert large-scale human egocentric videos into structured, schema-driven VQA supervision, resulting in the E2E-3M dataset. Trained on this data, the PhysBrain embodied brain shows improved egocentric understanding and planning (notably on EgoThink) and provides a strong initialization for VLA, achieving a 53.9% SimplerEnv success rate. The work demonstrates that scalable human egocentric supervision can effectively bridge vision-language understanding and physical intelligence, offering a scalable foundation for future first-person embodied AI with complementary robot data.

Abstract

Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.

PhysBrain: Human Egocentric Data as a Bridge from Vision Language Models to Physical Intelligence

TL;DR

<3-5 sentence high-level summary> The paper tackles robotic generalization by addressing the viewpoint gap between third-person Vision-Language Models and humanoid robots. It introduces the Egocentric2Embodiment translation pipeline to convert large-scale human egocentric videos into structured, schema-driven VQA supervision, resulting in the E2E-3M dataset. Trained on this data, the PhysBrain embodied brain shows improved egocentric understanding and planning (notably on EgoThink) and provides a strong initialization for VLA, achieving a 53.9% SimplerEnv success rate. The work demonstrates that scalable human egocentric supervision can effectively bridge vision-language understanding and physical intelligence, offering a scalable foundation for future first-person embodied AI with complementary robot data.

Abstract

Robotic generalization relies on physical intelligence: the ability to reason about state changes, contact-rich interactions, and long-horizon planning under egocentric perception and action. However, most VLMs are trained primarily on third-person data, creating a fundamental viewpoint mismatch for humanoid robots. Scaling robot egocentric data collection remains impractical due to high cost and limited diversity, whereas large-scale human egocentric videos offer a scalable alternative that naturally capture rich interaction context and causal structure. The key challenge is to convert raw egocentric videos into structured and reliable embodiment training supervision. Accordingly, we propose an Egocentric2Embodiment translation pipeline that transforms first-person videos into multi-level, schema-driven VQA supervision with enforced evidence grounding and temporal consistency, enabling the construction of the Egocentric2Embodiment dataset (E2E-3M) at scale. An egocentric-aware embodied brain, termed PhysBrain, is obtained by training on the E2E-3M dataset. PhysBrain exhibits substantially improved egocentric understanding, particularly for planning on EgoThink. It provides an egocentric-aware initialization that enables more sample-efficient VLA fine-tuning and higher SimplerEnv success rates (53.9\%), demonstrating effective transfer from human egocentric supervision to downstream robot control.

Paper Structure

This paper contains 27 sections, 9 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Human egocentric supervision improves first-person embodied brains and transfers to control. Left: EgoThink radar plot comparing egocentric VLM performance across six dimensions (Activity, Forecast, Localization, Object, Planning, Reasoning) for representative baselines. Right Top: "Phys" means that the VLM was supervised fine-tuning on our annotated first-person (egocentric) data (described in Sec. \ref{['subsec:Egocentric2Embodiment']}), both VST-7B and Qwen2.5-VL-7B achieve significantly better EgoThink performance, with the most pronounced gains on Planning. Right Bottom: when used as the VLM backbone in a standard VLA fine-tuning pipeline, the same Phys-enhanced backbones yield substantially higher SimplerEnv success rates, indicating that better egocentric planning and interaction reasoning translate to improved downstream manipulation.
  • Figure 2: Illustration of the Egocentric2Embodiment Translation Pipeline.
  • Figure 3: Overview and Data Distribution Statistics of E2E-3M dataset.
  • Figure 4: VLA architecture built on PhysBrain. Given an egocentric observation sequence and a language instruction, PhysBrain encodes multimodal context for action generation. (a) PhysGR00T conditions a flow-matching diffusion action expert on the last-layer hidden states of PhysBrain. (b) PhysPI more tightly couples PhysBrain and the action expert by injecting multiple VLM layers via layer-wise cross-attention.