ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Jinhui Ye; Fangjing Wang; Ning Gao; Junqiu Yu; Yangkun Zhu; Bin Wang; Jinyu Zhang; Weiyang Jin; Yanwei Fu; Feng Zheng; Yilun Chen; Jiangmiao Pang

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, Jiangmiao Pang

TL;DR

ST4VLA addresses the gap between multimodal understanding and embodied robot control by injecting transferable spatial priors into a two-stage Vision-Language-Action framework. A slow VLM planner learns spatial grounding in Stage 1, while a fast action expert executes grounded policies in Stage 2, with a lightweight querying transformer and gradient-decay to keep perception and action objectives aligned. Empirically, ST4VLA achieves state-of-the-art results across public benchmarks, large-scale simulated pick-and-place, and real-world long-horizon tasks, with strong generalization to unseen objects and instructions. The approach demonstrates that scalable spatial grounding and spatial prompting can robustly bridge perception, planning, and action in embodied AI, offering practical benefits for real-world robotics. Overall, spatially guided training emerges as a principled path to robust, generalist robot learning.

Abstract

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 -> 84.6 on Google Robot and from 54.7 -> 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https://internrobotics.github.io/internvla-m1.github.io/

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

TL;DR

Abstract

Paper Structure (45 sections, 3 equations, 25 figures, 11 tables)

This paper contains 45 sections, 3 equations, 25 figures, 11 tables.

Introduction
Methods
Model Architecture
Training Recipe
Experiments
Preliminary: Perception-action Co-optimization
Experiments on Public Benchmark
Evaluation in Simulated Large-scale Pick-and-place
Evaluation in Real-world Cluttered-scene Pick-and-place
Evaluation in Long-horizon Manipulation
Related Work
Discussion and Conclusion
Projection-space Similarity (PSS)
Setup.
Projection-space similarity (PSS).
...and 30 more sections

Figures (25)

Figure 1: ST4VLA integrates spatial priors into the vision–language–action training pipeline. Given a task instruction, the VLM planner produces latent plans through explicit spatial prompting, which then effectively guides the action expert to generate control signals.
Figure 2: Overview of ST4VLA. ST4VLA adopts a spatially guided two-stage training pipeline. Stage 1 (spatial grounding pre-training): the VLM is trained on large-scale multisource multimodal spatial grounding data to learn embodiment-agnostic spatial priors. Stage 2 (spatially guided action post-training): the VLM Planner, functioning as a slow but reliable System 2 reasoner, generates latent planning tokens via spatial prompting as the condition to the action expert (instantiated as a DiT Actor) to execute as a fast System 1 embodiment-specific controller.
Figure 3: Ablation study on the effect of auxiliary spatial prompting during co-training. From left to right: (a) perception performance (IoU@0.5 on RefCOCO-g); (b) manipulation performance (Average Success Rate on WidowX); (c) shows the gradient similarity of the spatial grounding and action policy objectives, when taking vanilla co-training or the proposed spatially prompting co-training.
Figure 4: Success rate (%) across different generalization settings on 200 simulated instruction-following pick-and-place tasks.
Figure 5: Demonstration and results of long-horizon instruction-following manipulation tasks.
...and 20 more figures

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

TL;DR

Abstract

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

Authors

TL;DR

Abstract

Table of Contents

Figures (25)