Table of Contents
Fetching ...

PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking

Weikai Qin, Sichen Wu, Ci Chen, Mengfan Liu, Linxi Feng, Xinru Cui, Haoqi Han, Hesheng Wang

TL;DR

A semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control and demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.

Abstract

In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.

PhysiFlow: Physics-Aware Humanoid Whole-Body VLA via Multi-Brain Latent Flow Matching and Robust Tracking

TL;DR

A semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control and demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.

Abstract

In the domain of humanoid robot control, the fusion of Vision-Language-Action (VLA) with whole-body control is essential for semantically guided execution of real-world tasks. However, existing methods encounter challenges in terms of low VLA inference efficiency or an absence of effective semantic guidance for whole-body control, resulting in instability in dynamic limb-coordinated tasks. To bridge this gap, we present a semantic-motion intent guided, physics-aware multi-brain VLA framework for humanoid whole-body control. A series of experiments was conducted to evaluate the performance of the proposed framework. The experimental results demonstrated that the framework enabled reliable vision-language-guided full-body coordination for humanoid robots.
Paper Structure (17 sections, 5 equations, 6 figures, 2 tables)

This paper contains 17 sections, 5 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Introducing PhysiFlow, a multi-brain VLA humanoid system that operates on Unitree G1 robots and performs end-to-end VLA humanoid whole body control in large spaces. The proposed system achieves consecutive tasks autonomously, including (a-c) walking to the designated item, sitting on the proposed item, and raising arm; (d-f) circling the designated item, standing up from the specific item and turning right.
  • Figure 2: The overall pipeline of PhysiFlow. This bio-inspired architecture decouples semantic reasoning from physics-aware execution. (a) Neocortical Brain: A curriculum-based CVAE processes vision and language to synthesize a 10 latent vector $z_{vl}$, aligning task semantics with motion intent. (b) Basal Ganglionic Brain: Conditioned on $z_{vl}$ and robot states, a flow-matching model generates 50 motion sequence $m_t$ for continuity. (c) Cerebellar Brain: A robust motion tracker enforces physical constraints, translating these chunks into stable motor commands for closed-loop whole-body control.
  • Figure 3: Visualization of the VLA dataset.(a) Diverse visuals with various Scenes and Items. (b) Diverse camera angles with ego and exo views. (c) Diverse task from turning around to standing up
  • Figure 4: Performance benchmarking of the Basal Ganglionic Brain. The proposed flow-matching (FM) paradigm is evaluated against autoregressive (AR) and Denoising Diffusion Probabilistic Model (DDPM) baselines.
  • Figure 5: Real-world execution of semantically guided whole-body tasks by the Unitree G1 humanoid robot.Top: Complex VLA maneuvers requiring continuous spatial navigation and dynamic multi-limb coordination. Bottom: Basic VLA tasks demonstrating responsive semantic execution and robust postural stability. These results validate the system's capacity to maintain physical compliance and dynamic consistency during unconstrained deployment.
  • ...and 1 more figures