Table of Contents
Fetching ...

GR-3 Technical Report

Chilam Cheang, Sijin Chen, Zhongren Cui, Yingdong Hu, Liqun Huang, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Xiao Ma, Hao Niu, Wenxuan Ou, Wanli Peng, Zeyu Ren, Haixin Shi, Jiawen Tian, Hongtao Wu, Xin Xiao, Yuyang Xiao, Jiafeng Xu, Yichu Yang

TL;DR

GR-3 introduces a 4B-parameter Vision-Language-Action model capable of following complex instructions, generalizing to novel objects and environments, and handling long-horizon, dexterous tasks. It combines three data streams—robot trajectories, web-scale vision-language data, and few-shot human trajectories—via a co-training framework with flow-matching and next-token objectives, enabling rapid adaptation with minimal human data. The ByteMini robot provides a flexible platform for real-world evaluation, with whole-body control and teleoperation supporting diverse manipulation tasks. Across generalizable pick-and-place, long-horizon table bussing, and dexterous cloth manipulation, GR-3 outperforms the state-of-the-art π_0, demonstrating strong zero-shot and few-shot generalization, as well as robustness in complex tasks. The work highlights a scalable pathway toward generalist robots that can assist humans in daily life, while acknowledging limitations and avenues for future reinforcement learning integration.

Abstract

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, $π_0$, on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

GR-3 Technical Report

TL;DR

GR-3 introduces a 4B-parameter Vision-Language-Action model capable of following complex instructions, generalizing to novel objects and environments, and handling long-horizon, dexterous tasks. It combines three data streams—robot trajectories, web-scale vision-language data, and few-shot human trajectories—via a co-training framework with flow-matching and next-token objectives, enabling rapid adaptation with minimal human data. The ByteMini robot provides a flexible platform for real-world evaluation, with whole-body control and teleoperation supporting diverse manipulation tasks. Across generalizable pick-and-place, long-horizon table bussing, and dexterous cloth manipulation, GR-3 outperforms the state-of-the-art π_0, demonstrating strong zero-shot and few-shot generalization, as well as robustness in complex tasks. The work highlights a scalable pathway toward generalist robots that can assist humans in daily life, while acknowledging limitations and avenues for future reinforcement learning integration.

Abstract

We report our recent progress towards building generalist robot policies, the development of GR-3. GR-3 is a large-scale vision-language-action (VLA) model. It showcases exceptional capabilities in generalizing to novel objects, environments, and instructions involving abstract concepts. Furthermore, it can be efficiently fine-tuned with minimal human trajectory data, enabling rapid and cost-effective adaptation to new settings. GR-3 also excels in handling long-horizon and dexterous tasks, including those requiring bi-manual manipulation and mobile movement, showcasing robust and reliable performance. These capabilities are achieved through a multi-faceted training recipe that includes co-training with web-scale vision-language data, efficient fine-tuning from human trajectory data collected via VR devices, and effective imitation learning with robot trajectory data. In addition, we introduce ByteMini, a versatile bi-manual mobile robot designed with exceptional flexibility and reliability, capable of accomplishing a wide range of tasks when integrated with GR-3. Through extensive real-world experiments, we show GR-3 surpasses the state-of-the-art baseline method, , on a wide variety of challenging tasks. We hope GR-3 can serve as a step towards building generalist robots capable of assisting humans in daily life.

Paper Structure

This paper contains 36 sections, 2 equations, 10 figures.

Figures (10)

  • Figure 1: Overview. GR-3 is able to learn from three types of data: vision-language data, robot trajectory data, and human trajectory data. It is able to perform dexterous and long-horizon tasks with exceptional robustness and generalize well to novel objects, environments, and instructions.
  • Figure 2: Capabilities. GR-3 strictly follows instructions and is capable of understanding unseen instructions involving abstract concepts. It performs robustly and reliably on long-horizon table bussing and dexterous cloth manipulation.
  • Figure 3: The GR-3 Model. GR-3 is co-trained on both robot trajectories and vision-language data with a flow-matching objective (left) and a next-token-prediction objective (right), respectively.
  • Figure 4: The GR-3 Data. We leverage three types of data during training: robot trajectory data (top), human trajectory data (middle), and vision-language data (bottom).
  • Figure 5: The ByteMini Robot. We show the robot specifications, multi-camera views, and motion range of the unique wrist sphere joint.
  • ...and 5 more figures