Table of Contents
Fetching ...

HoloBrain-0 Technical Report

Xuewu Lin, Tianwei Lin, Yun Du, Hongyu Xie, Yiwei Jin, Jiawei Li, Shijie Wu, Qingze Wang, Mengdi Li, Mengao Zhao, Ziang Li, Chaodong Huang, Hongzhe Bi, Lichao Huang, Zhizhong Su

TL;DR

HoloBrain-0 introduces a cross-embodiment Vision-Language-Action framework that explicitly grounds VLA policies in robot embodiment priors (e.g., URDF, multi-view cameras) to enable robust 3D spatial reasoning and generalization. Its architecture combines a VLM backbone with a Perspective-aware Spatial Enhancer and an Embodiment-aware Action Expert, augmented by SimpleRTC and Teacher Forcing to achieve smooth, low-latency deployment. The work couples a two-stage data strategy—large-scale cross-embodiment pre-training and test-driven post-training—with RoboOrchard, a full-stack open-source infrastructure for data curation, training, and deployment, achieving SOTA results on RoboTwin 2.0, LIBERO, GenieSim, and challenging real-world tasks with a compact 0.2B parameter variant. The combination of embodiment priors, a lightweight yet capable action head, and an end-to-end data-to-deployment pipeline offers a practical, reproducible path toward generalist robotic manipulation.

Abstract

In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train" paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.

HoloBrain-0 Technical Report

TL;DR

HoloBrain-0 introduces a cross-embodiment Vision-Language-Action framework that explicitly grounds VLA policies in robot embodiment priors (e.g., URDF, multi-view cameras) to enable robust 3D spatial reasoning and generalization. Its architecture combines a VLM backbone with a Perspective-aware Spatial Enhancer and an Embodiment-aware Action Expert, augmented by SimpleRTC and Teacher Forcing to achieve smooth, low-latency deployment. The work couples a two-stage data strategy—large-scale cross-embodiment pre-training and test-driven post-training—with RoboOrchard, a full-stack open-source infrastructure for data curation, training, and deployment, achieving SOTA results on RoboTwin 2.0, LIBERO, GenieSim, and challenging real-world tasks with a compact 0.2B parameter variant. The combination of embodiment priors, a lightweight yet capable action head, and an end-to-end data-to-deployment pipeline offers a practical, reproducible path toward generalist robotic manipulation.

Abstract

In this work, we introduce HoloBrain-0, a comprehensive Vision-Language-Action (VLA) framework that bridges the gap between foundation model research and reliable real-world robot deployment. The core of our system is a novel VLA architecture that explicitly incorporates robot embodiment priors, including multi-view camera parameters and kinematic descriptions (URDF), to enhance 3D spatial reasoning and support diverse embodiments. We validate this design through a scalable ``pre-train then post-train" paradigm, achieving state-of-the-art results on simulation benchmarks such as RoboTwin 2.0, LIBERO, and GenieSim, as well as strong results on challenging long-horizon real-world manipulation tasks. Notably, our efficient 0.2B-parameter variant rivals significantly larger baselines, enabling low-latency on-device deployment. To further accelerate research and practical adoption, we fully open-source the entire HoloBrain ecosystem, which includes: (1) powerful pre-trained VLA foundations; (2) post-trained checkpoints for multiple simulation suites and real-world tasks; and (3) RoboOrchard, a full-stack VLA infrastructure for data curation, model training and deployment. Together with standardized data collection protocols, this release provides the community with a complete, reproducible path toward high-performance robotic manipulation.
Paper Structure (30 sections, 9 equations, 13 figures, 14 tables)

This paper contains 30 sections, 9 equations, 13 figures, 14 tables.

Figures (13)

  • Figure 1: Overview of HoloBrain-0. By incorporating explicit embodiment modeling (e.g., camera parameters and kinematic descriptions), our model effectively unifies training across heterogeneous robots. Together with a full-stack VLA infrastructure (RoboOrchard) and an effective test-driven data strategy, HoloBrain-0 delivers superior performance on both real world and simulation manipulation benchmarks.
  • Figure 2: Visualization of input state representation and output action space of our action expert.
  • Figure 3: Visualizing 3D consistency verification. We project the 6D pose of each joint onto the image coordinates (including third-person and wrist cameras) based on the camera intrinsic and extrinsic parameters. Any episodes with inaccurate projection results are identified as erroneous and subsequently filtered out.
  • Figure 4: Overview of the RoboOrchard infrastructure. The system comprises three decoupled layers: a bottom Hardware Abstraction Layer that bridges the gap between simulation and real-world hardware via unified interfaces; a central Middleware Layer that drives the data-to-policy pipeline including storage, training, and deployment; and a top Interaction Layer facilitating user management and visualization.
  • Figure 5: Real-world evaluation task suite for HoloBrain-0. The suite comprises 7 basic tasks (shaded in gray), 2 long-horizon dexterous manipulation tasks, and 1 general object pick-and-place task.
  • ...and 8 more figures