Table of Contents
Fetching ...

Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Zongze Du, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

TL;DR

This work tackles the difficulty of transferring Vision-Language Models' planning capabilities to physical robot control by decoupling high-level planning from low-level action. It introduces a generalizable action expert that refines sparse 3D waypoints provided by a VLM into dense, executable trajectories using real-time point-cloud observations, enabled by the Action Pre-training, Pointcloud Fine-tuning paradigm. The approach achieves strong zero-shot generalization, including long-horizon tasks and cross-domain scenarios, while preserving the VLM's language and reasoning capabilities with minimal fine-tuning. By bridging planning and execution with a clear geometric interface, the method significantly improves data efficiency and robustness for real-world robotic manipulation.

Abstract

Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent dual-system approaches attempt to decouple "thinking" from "acting", they are often constrained by semantic ambiguities within the action module. This ambiguity makes large-scale, cross-task training infeasible. Consequently, these systems typically necessitate fine-tuning on newly collected data when deployed to novel environments, and the cooperation mechanism between the two systems remains ill-defined. To address these limitations, we introduce, for the first time, a framework centered around a generalizable action expert. Our approach utilizes sparse 3D trajectories as an intermediate representation, effectively bridging the high-level planning capabilities of the VLM with the low-level physical action module. During the planning phase, the VLM is only required to generate coarse 3D waypoints. These waypoints are then processed by our generalizable action expert, which refines them into dense, executable action sequences by sampling real-time point cloud observations of the environment. To promote training efficiency and robust generalization, we introduce a novel "Action Pre-training, Pointcloud Fine-tuning" paradigm. Our method combines the broad generalization capabilities of VLMs in visual understanding and planning with the fine-grained, action-level generalization of action expert.

Bridge Thinking and Acting: Unleashing Physical Potential of VLM with Generalizable Action Expert

TL;DR

This work tackles the difficulty of transferring Vision-Language Models' planning capabilities to physical robot control by decoupling high-level planning from low-level action. It introduces a generalizable action expert that refines sparse 3D waypoints provided by a VLM into dense, executable trajectories using real-time point-cloud observations, enabled by the Action Pre-training, Pointcloud Fine-tuning paradigm. The approach achieves strong zero-shot generalization, including long-horizon tasks and cross-domain scenarios, while preserving the VLM's language and reasoning capabilities with minimal fine-tuning. By bridging planning and execution with a clear geometric interface, the method significantly improves data efficiency and robustness for real-world robotic manipulation.

Abstract

Although Vision-Language Models (VLM) have demonstrated impressive planning and reasoning capabilities, translating these abilities into the physical world introduces significant challenges. Conventional Vision-Language-Action (VLA) models, which integrate reasoning and action into a monolithic architecture, generalize poorly because they are constrained by scarce, narrow-domain data. While recent dual-system approaches attempt to decouple "thinking" from "acting", they are often constrained by semantic ambiguities within the action module. This ambiguity makes large-scale, cross-task training infeasible. Consequently, these systems typically necessitate fine-tuning on newly collected data when deployed to novel environments, and the cooperation mechanism between the two systems remains ill-defined. To address these limitations, we introduce, for the first time, a framework centered around a generalizable action expert. Our approach utilizes sparse 3D trajectories as an intermediate representation, effectively bridging the high-level planning capabilities of the VLM with the low-level physical action module. During the planning phase, the VLM is only required to generate coarse 3D waypoints. These waypoints are then processed by our generalizable action expert, which refines them into dense, executable action sequences by sampling real-time point cloud observations of the environment. To promote training efficiency and robust generalization, we introduce a novel "Action Pre-training, Pointcloud Fine-tuning" paradigm. Our method combines the broad generalization capabilities of VLMs in visual understanding and planning with the fine-grained, action-level generalization of action expert.

Paper Structure

This paper contains 29 sections, 4 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The pipeline of our proposed method. Our approach begins with a VLM predicting a sparse set of 3D waypoints directly in the camera frame, preserving its vision-centric knowledge. These sparse points are then transformed and interpolated via a B-spline into a continuous and smooth pose trajectory to provide dense guidance for a low-level action expert.
  • Figure 2: Overview of our data annotation pipeline. We construct our SFT dataset by first selecting keyframes based on gripper state changes.
  • Figure 3: Real World Task Setting.
  • Figure 4: Ablation on training steps.
  • Figure 5: Ablation on training strategy.
  • ...and 2 more figures