GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team; Boyuan Wang; Chaojun Ni; Guan Huang; Guosheng Zhao; Hao Li; Jie Li; Jindi Lv; Jingyu Liu; Lv Feng; Mingming Yu; Peng Li; Qiuping Deng; Tianze Liu; Xinyu Zhou; Xinze Chen; Xiaofeng Wang; Yang Wang; Yifan Li; Yifei Nie; Yilong Li; Yukun Zhou; Yun Ye; Zhichao Liu; Zheng Zhu

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, Zheng Zhu

TL;DR

GigaBrain-0.5M* tackles the limited foresight of Vision-Language-Action models by integrating world-model-based reinforcement learning through the RAMP framework. Building on GigaBrain-0.5, it conditions the policy on future-state and value predictions from a pre-trained world model, enabling self-improvement via human-in-the-loop rollouts and continual training. Empirical results show strong long-horizon manipulation capabilities and cross-task generalization, with substantial gains over baselines and reliable real-world deployment. The approach achieves state-of-the-art performance on internal tasks and RoboChallenge benchmarks, demonstrating the practical impact of world-model conditioning for embodied AI.

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose \textit{GigaBrain-0.5M*}, a VLA model trained via world model-based reinforcement learning. Built upon \textit{GigaBrain-0.5}, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. \textit{GigaBrain-0.5M*} further integrates world model-based reinforcement learning via \textit{RAMP} (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that \textit{RAMP} achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including \texttt{Laundry Folding}, \texttt{Box Packing}, and \texttt{Espresso Preparation}. Critically, \textit{GigaBrain-0.5M$^*$} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

TL;DR

Abstract

} exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our \href{https://gigabrain05m.github.io}{project page}.

Paper Structure (14 sections, 8 equations, 14 figures, 1 table)

This paper contains 14 sections, 8 equations, 14 figures, 1 table.

Introduction
Related Works
Vision-Language-Action Models
World Models for Policy Models
Reinforcement Learning for Vision-Language-Action Models
GigaBrain-0.5M*
GigaBrain-0.5
RAMP
RAMP Formulation
The RAMP Implementation
Experiment
Foundation Model Performance
RAMP Performance
Conclusion and Future Work

Figures (14)

Figure 1: Overview of RAMP. The RAMP framework operates through a four-stage pipeline. (1) World Model Pre-training establishes a unified representation space for both future state prediction and value estimation. (2) Policy Training with World Model Condition initializes the GigaBrain-0.5 policy with explicit world model conditioning. (3) Human-in-the-Loop Rollout (HILR) Data Collection generates diverse and high-quality trajectories through autonomous execution followed by expert corrections. (4) Continual Training with Rollout Data updates the policy using the annotated trajectory data, incorporating both successful demonstrations and corrective signals. This tightly integrated closed-loop process facilitates continuous policy refinement and self-improvement.
Figure 2: Data distribution of the pre-training stage of GigaBrain-0.5.
Figure 3: Performance of GigaBrain-0.5 on internal evaluation.
Figure 4: Deployment of GigaBrain-0.5 on PiPER arms for real-world Box Packing.
Figure 5: Deployment of GigaBrain-0.5 on the G1 humanoid robot for real-world Box Moving.
...and 9 more figures

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

TL;DR

Abstract

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (14)