Table of Contents
Fetching ...

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

Wenxuan Song, Ziyang Zhou, Han Zhao, Jiayi Chen, Pengxiang Ding, Haodong Yan, Yuxin Huang, Feilong Tang, Donglin Wang, Haoang Li

TL;DR

ReconVLA addresses the problem of dispersed visual attention in Vision-Language-Action models by introducing an implicit grounding mechanism that reconstructs the gaze region with a diffusion-based denoiser conditioned on the VLA's visual outputs. The approach relies on a latent visual reconstruction framework using a VAE-based visual tokenizer and a large-scale pretraining dataset (over 100k trajectories and 2M samples) to boost generalization in reconstruction and manipulation. Through simulation and real-world experiments, ReconVLA demonstrates superior precise manipulation and robust generalization to unseen objects, outperforming explicit grounding, chain-of-thought grounding, and other generative baselines. The work highlights the practical impact of focusing perception on target regions to improve long-horizon robotic manipulation and supports deployment in diverse environments.

Abstract

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model's generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization. Our project page is https://zionchow.github.io/ReconVLA/.

ReconVLA: Reconstructive Vision-Language-Action Model as Effective Robot Perceiver

TL;DR

ReconVLA addresses the problem of dispersed visual attention in Vision-Language-Action models by introducing an implicit grounding mechanism that reconstructs the gaze region with a diffusion-based denoiser conditioned on the VLA's visual outputs. The approach relies on a latent visual reconstruction framework using a VAE-based visual tokenizer and a large-scale pretraining dataset (over 100k trajectories and 2M samples) to boost generalization in reconstruction and manipulation. Through simulation and real-world experiments, ReconVLA demonstrates superior precise manipulation and robust generalization to unseen objects, outperforming explicit grounding, chain-of-thought grounding, and other generative baselines. The work highlights the practical impact of focusing perception on target regions to improve long-horizon robotic manipulation and supports deployment in diverse environments.

Abstract

Recent advances in Vision-Language-Action (VLA) models have enabled robotic agents to integrate multimodal understanding with action execution. However, our empirical analysis reveals that current VLAs struggle to allocate visual attention to target regions. Instead, visual attention is always dispersed. To guide the visual attention grounding on the correct target, we propose ReconVLA, a reconstructive VLA model with an implicit grounding paradigm. Conditioned on the model's visual outputs, a diffusion transformer aims to reconstruct the gaze region of the image, which corresponds to the target manipulated objects. This process prompts the VLA model to learn fine-grained representations and accurately allocate visual attention, thus effectively leveraging task-specific visual information and conducting precise manipulation. Moreover, we curate a large-scale pretraining dataset comprising over 100k trajectories and 2 million data samples from open-source robotic datasets, further boosting the model's generalization in visual reconstruction. Extensive experiments in simulation and the real world demonstrate the superiority of our implicit grounding method, showcasing its capabilities of precise manipulation and generalization. Our project page is https://zionchow.github.io/ReconVLA/.

Paper Structure

This paper contains 35 sections, 3 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Visualization of the observation, gaze region, and attention map. For a long-horizon task "stack blocks" that requires the arm to lift the blue block and put it on the pink one. Although there are several distractors, our model adaptively adjusts the gaze region, guiding the allocation of visual attention to the right target. With the precise visual grounding, it sequentially manipulates different target objects and successfully completes the task.
  • Figure 2: Conceptual comparison between different paradigms.(a) Explicit Grounding: Employing an external grounding expert and inputting entire images and cropped images robogroundvirt. (b) CoT Grounding: Outputting coordinates of bounding boxes before action in a chain-of-thought (CoT) manner ecotgraspvla. (c) Implicit Grounding: Our ReconVLA directly leverages crucial regions as implicit visual supervision for visual outputs, called reconstructive tokens, through a reconstruction process.
  • Figure 3: Architecture of our ReconVLA. Our model consists of a reconstructive part and an action part. The input includes multi-view images and a text instruction. For the action part, the model outputs discrete action tokens. For the reconstruction part, our ReconVLA is guided to output reconstructive tokens, which are conditions of the denoising process to reconstruct the scene tokens $z_0$ from noisy $z_t$. The scene tokens are tokenized images of gaze regions. This supervision enables our ReconVLA to enhance visual grounding and fine-grained comprehension capabilities, which contribute to precise manipulation.
  • Figure 4: Qualitative comparison of attention maps on CALVIN calvin and the real world.Row 1: The baseline exhibits dispersed attention patterns or predominantly attends to an incorrect region, leading to inaccurate actions. Row 2: With auxiliary visual supervision signals, ReconVLA forces the model to focus on specific image contents with higher attention values and precisely move to the target region, thus successfully completing the task.
  • Figure 5: Real-world Setup of four representative tasks. We use a 6-DoF AgileX PiPer robotic arm with a 1-DoF parallel gripper and a RealSense D515 depth camera as Eye-on-Base and an ORBBEC Dabai depth camera as Eye-on-Hand. We selected four representative and practically meaningful tasks: (1) Stack bowls, (2) Put fruit into bowl, (3) Flip cups, (4) Bus table.
  • ...and 1 more figures