Table of Contents
Fetching ...

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

Ganlin Yang, Tianyi Zhang, Haoran Hao, Weiyun Wang, Yibin Liu, Dehui Wang, Guanzhou Chen, Zijian Cai, Junting Chen, Weijie Su, Wengang Zhou, Yu Qiao, Jifeng Dai, Jiangmiao Pang, Gen Luo, Wenhai Wang, Yao Mu, Zhi Hou

TL;DR

Vlaser addresses the gap between upstream embodied reasoning in Vision-Language Models and downstream Vision-Language-Action policy learning by introducing a two-component embodied VLM with a dedicated data engine and a two-stage training pipeline. Using the Vlaser-6M dataset, it achieves state-of-the-art results across 12 embodied-reasoning benchmarks and demonstrates that in-domain simulated data more effectively accelerates VLA fine-tuning than out-of-domain data, while revealing a domain gap that limits transfer to real robots. The architecture couples an InternVL3-based VLM backbone with a flow-matching action-expert for low-level control, enabling robust open-loop reasoning and closed-loop manipulation, and provides practical guidance on data streams for VLA transfer. The work offers open-source resources to support reproducibility and future research in embodied AI, with implications for real-world robotic control and generalization across tasks and embodiments.

Abstract

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

TL;DR

Vlaser addresses the gap between upstream embodied reasoning in Vision-Language Models and downstream Vision-Language-Action policy learning by introducing a two-component embodied VLM with a dedicated data engine and a two-stage training pipeline. Using the Vlaser-6M dataset, it achieves state-of-the-art results across 12 embodied-reasoning benchmarks and demonstrates that in-domain simulated data more effectively accelerates VLA fine-tuning than out-of-domain data, while revealing a domain gap that limits transfer to real robots. The architecture couples an InternVL3-based VLM backbone with a flow-matching action-expert for low-level control, enabling robust open-loop reasoning and closed-loop manipulation, and provides practical guidance on data streams for VLA transfer. The work offers open-source resources to support reproducibility and future research in embodied AI, with implications for real-world robotic control and generalization across tasks and embodiments.

Abstract

While significant research has focused on developing embodied reasoning capabilities using Vision-Language Models (VLMs) or integrating advanced VLMs into Vision-Language-Action (VLA) models for end-to-end robot control, few studies directly address the critical gap between upstream VLM-based reasoning and downstream VLA policy learning. In this work, we take an initial step toward bridging embodied reasoning with VLA policy learning by introducing Vlaser - a Vision-Language-Action Model with synergistic embodied reasoning capability, which is a foundational vision-language model designed to integrate high-level reasoning with low-level control for embodied agents. Built upon the high-quality Vlaser-6M dataset, Vlaser achieves state-of-the-art performance across a range of embodied reasoning benchmarks - including spatial reasoning, embodied grounding, embodied QA, and task planning. Furthermore, we systematically examine how different VLM initializations affect supervised VLA fine-tuning, offering novel insights into mitigating the domain shift between internet-scale pre-training data and embodied-specific policy learning data. Based on these insights, our approach achieves state-of-the-art results on the WidowX benchmark and competitive performance on the Google Robot benchmark.

Paper Structure

This paper contains 15 sections, 3 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Overall framework, capabilities, and evaluation of Vlaser.Top-left: Composition of the Vlaser-6M dataset, featuring multi-task embodied data—including QA, grounding, spatial reasoning, and planning—along with in-domain simulation-sourced pairs. Top-right: A LiDAR visualization illustrating the state-of-the-art embodied reasoning capability of the Vlaser VLM. Bottom-left: The pre-trained Vlaser VLM significantly accelerates convergence in downstream Vision-Language Action model (VLA) policy learning on WidowX platform bridgedata. Bottom-right: Successful closed-loop operation of an agent powered by Vlaser within the SimplerEnv benchmark li24simpler.
  • Figure 2: An illustration of Vlaser architecture. Vlaser includes two components and corresponding training phases: 1) the Multimodal Pretraining is for embodied reasoning enhancement based on the corresponding data engine; 2) VLA training is performed on the action expert module, which handles low-level control based on flow matching action generation.
  • Figure 3: An illustration of Vlaser-6M data engine for in-domain general QA sample in SimplerEnv.
  • Figure 4: An illustration of Vlaser-6M data engine for in-domain embodied grounding QA sample in SimplerEnv.
  • Figure 5: An illustration of Vlaser-6M data engine for in-domain spatial reasoning QA sample in SimplerEnv.
  • ...and 2 more figures