Table of Contents
Fetching ...

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Ruisen Tu, Arth Shukla, Sohyun Yoo, Xuanlin Li, Junxi Li, Jianwen Xie, Hao Su, Zhuowen Tu

Abstract

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation quality, we co-train auxiliary decoders that reconstruct interpretable intermediate signals - including global robot position, joint configurations, grasp affordances, target-object relative pose, and segmentation masks - from shared visual-language features. These objectives provide dense supervision that encourages the backbone to develop spatially grounded, manipulation-aware latent representations. Through extensive evaluation on home rearrangement tasks, our approach achieves consistent improvements across picking, placing, opening, and closing operations, substantially outperforming direct imitation learning. Our findings suggest that spatial grounding through auxiliary and multi-modal learning provides a strong direction for scaling VLA models toward general-purpose domestic robots.

SG-VLA: Learning Spatially-Grounded Vision-Language-Action Models for Mobile Manipulation

Abstract

Vision-Language-Action (VLA) models show promise for robotic control, yet performance in complex household environments remains sub-optimal. Mobile manipulation requires reasoning about global scene layout, fine-grained geometry, and high-dimensional continuous actions, making standard imitation learning insufficient. We introduce a framework for learning spatially-grounded VLA models that strengthens perception and representation through auxiliary task co-training and multi-modal input enhancement. Our method addresses the challenge of controlling a 13-dimensional action space involving coordinated base motion, arm articulation, and gripper actuation. To enrich spatial understanding, the model incorporates multi-view RGB observations, depth cues, and short temporal history, providing perspectives of both global scene structure and local manipulation context. To improve representation quality, we co-train auxiliary decoders that reconstruct interpretable intermediate signals - including global robot position, joint configurations, grasp affordances, target-object relative pose, and segmentation masks - from shared visual-language features. These objectives provide dense supervision that encourages the backbone to develop spatially grounded, manipulation-aware latent representations. Through extensive evaluation on home rearrangement tasks, our approach achieves consistent improvements across picking, placing, opening, and closing operations, substantially outperforming direct imitation learning. Our findings suggest that spatial grounding through auxiliary and multi-modal learning provides a strong direction for scaling VLA models toward general-purpose domestic robots.
Paper Structure (20 sections, 5 equations, 4 figures, 5 tables)

This paper contains 20 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Multi-view segmentation data used for auxiliary task training. Each group shows (left) the original RGB observation, (middle) segmentation masks for all objects in the scene with different colors representing different object instances, and (right) the processed binary mask highlighting only the target object of interest. (a) Head camera view provides a global perspective of the manipulation scene. (b) Hand camera view offers a close-up perspective focused on the manipulation area. The target object masks are derived from the full segmentation by identifying the target object ID and creating binary masks to focus the model's attention on the relevant manipulation target.
  • Figure 2: SG-VLA Architecture Overview. The model processes multi-modal inputs, including RGB images and normalized depth maps from head and hand cameras, alongside a natural language instruction. During training time, the latent representation from LLM is then passed to the decoders for auxiliary task predictions. Finally, the model predicts a 13-dimensional action vector, generated either directly by the LLM backbone or through a specialized Flow Matching action expert depending on task type. The action vector consists of the following: $\Delta X$, a 3D vector representing the base's pose (position+orientation); $\Delta z$, a 1D scalar representing the change in the torso's height; $\Delta q$, a 7D vector representing the change in the arm's joint angles; $\Delta G$, a 2D vector representing the change in the gripper's state (one for each finger).
  • Figure 3: Multi-stage Training Scheme. Stage 1: Decoder adaptation phase where auxiliary decoders are trained while gradient flow to the VLM backbone is blocked, allowing decoders to learn from fixed VLM representations. Stage 2: Joint refinement phase with full gradient flow enabled, co-training all auxiliary decoders with the VLM backbone. Stage 3: Action head training phase where the VLM backbone is frozen to train the flow matching action head in isolation.
  • Figure 4: Sample execution trajectories for six household manipulation tasks in ManiSkill-HAB evaluation. Each row shows a temporal sequence for a task performed by SG-VLA. The large frames show the global environment, while the vertical insets display head depth, head RGB, hand depth, and hand RGB observations from top to bottom.