Table of Contents
Fetching ...

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

Peiyan Li, Yixiang Chen, Hongtao Wu, Xiao Ma, Xiangnan Wu, Yan Huang, Liang Wang, Tao Kong, Tieniu Tan

TL;DR

BridgeVLA addresses the data-inefficiency of 3D vision-language-action models by aligning 3D observations to a 2D image space using multi-view projections and predicting 2D heatmaps for translational actions. A scalable 2D-heatmap pre-training stage grounds objects in the VLM backbone before fine-tuning, preserving input-output spatial alignment throughout. The approach yields state-of-the-art results across RLBench, COLOSSEUM, GemBench, and real-robot experiments, with remarkable sample efficiency (e.g., 96.8% on 10+ tasks from only 3 trajectories). These results demonstrate strong generalization under visual disturbances and unseen instructions, suggesting broad practical impact for efficient 3D manipulation with vision-language models.

Abstract

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

BridgeVLA: Input-Output Alignment for Efficient 3D Manipulation Learning with Vision-Language Models

TL;DR

BridgeVLA addresses the data-inefficiency of 3D vision-language-action models by aligning 3D observations to a 2D image space using multi-view projections and predicting 2D heatmaps for translational actions. A scalable 2D-heatmap pre-training stage grounds objects in the VLM backbone before fine-tuning, preserving input-output spatial alignment throughout. The approach yields state-of-the-art results across RLBench, COLOSSEUM, GemBench, and real-robot experiments, with remarkable sample efficiency (e.g., 96.8% on 10+ tasks from only 3 trajectories). These results demonstrate strong generalization under visual disturbances and unseen instructions, suggesting broad practical impact for efficient 3D manipulation with vision-language models.

Abstract

Recently, leveraging pre-trained vision-language models (VLMs) for building vision-language-action (VLA) models has emerged as a promising approach to effective robot manipulation learning. However, only few methods incorporate 3D signals into VLMs for action prediction, and they do not fully leverage the spatial structure inherent in 3D data, leading to low sample efficiency. In this paper, we introduce BridgeVLA, a novel 3D VLA model that (1) projects 3D inputs to multiple 2D images, ensuring input alignment with the VLM backbone, and (2) utilizes 2D heatmaps for action prediction, unifying the input and output spaces within a consistent 2D image space. In addition, we propose a scalable pre-training method that equips the VLM backbone with the capability to predict 2D heatmaps before downstream policy learning. Extensive experiments show the proposed method is able to learn 3D manipulation efficiently and effectively. BridgeVLA outperforms state-of-the-art baseline methods across three simulation benchmarks. In RLBench, it improves the average success rate from 81.4% to 88.2%. In COLOSSEUM, it demonstrates significantly better performance in challenging generalization settings, boosting the average success rate from 56.7% to 64.0%. In GemBench, it surpasses all the comparing baseline methods in terms of average success rate. In real-robot experiments, BridgeVLA outperforms a state-of-the-art baseline method by 32% on average. It generalizes robustly in multiple out-of-distribution settings, including visual disturbances and unseen instructions. Remarkably, it is able to achieve a success rate of 96.8% on 10+ tasks with only 3 trajectories per task, highlighting its extraordinary sample efficiency. Project Website:https://bridgevla.github.io/

Paper Structure

This paper contains 42 sections, 4 equations, 14 figures, 12 tables.

Figures (14)

  • Figure 1: Overview. BridgeVLA is a novel 3D VLA model that aligns the input and output within a unified 2D image space. It is pre-trained on object grounding using 2D heatmaps and fine-tuned on action prediction for 3D manipulation. Experiment results in both simulation and the real world show that it is able to learn 3D manipulation both efficiently and effectively.
  • Figure 2: Model Architecture. (a) 2D Heatmap Pre-training: we train BridgeVLA on 2D object detection datasets. The model takes as inputs an image and a language describing the target object and outputs a 2D heatmap which highlights regions of interest that correspond to the target object. Note that the bounding box shown here is for illustrative purposes only; it is not present in the image when input to the model. (b) 3D Action Fine-tuning: the model takes as inputs three orthographic projection images of a 3D point cloud and a language instruction. It outputs three 2D heatmaps, which highlight the position of the end-effector in the next keyframe across all three views. For the remaining action components, it uses an MLP to process the image feature tokens to predict the rotation action, gripper action, and collision flag of the next keyframe.
  • Figure 3: Real-Robot Experiments and Results. We use a Franka Research 3 robot arm and a ZED 2i camera to capture point clouds of the scene. To evaluate the model's performance, we design 7 different settings including one basic setting and six generalization settings. Experimental results show that BridgeVLA outperforms the state-of-the-art baseline method RVT-2 goyal2024rvt by an average of 32%.
  • Figure 4: Prediction on Pre-training Data after Fine-tuning. To simulate the multi-view inputs during fine-tuning, we repeat each pre-training image three times and feed them into the fine-tuned model to generate heatmaps. Note that these samples are not cherry-picked. Additional samples can be found in Appendix \ref{['app:real_visualize_pretrain_heatmap']}.
  • Figure 5: Visualization of 18 RLBench james2020rlbench Tasks.
  • ...and 9 more figures