Table of Contents
Fetching ...

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, Qiang Zhang, Yun Ye, Yang Wang, Guan Huang, Wenjun Mei

TL;DR

SwiftVLA tackles the practicality gap of Vision-Language-Action models by enabling 4D spatiotemporal reasoning in a compact VLM. It uses a pretrained 4D visual geometry transformer with a temporal cache to produce 4D features, fuses them with 2D cues via learnable Fusion Tokens, and trains with a mask-and-reconstruct objective to distill 4D knowledge into a lightweight VLA. The fusion tokens are supervised by the end-effector future trajectory, aligning multimodal representations for action generation, while inference drops the 4D branch to keep overhead low. Across RoboTwin 2.0 and LIBERO benchmarks, SwiftVLA matches or surpasses larger models, and on edge devices achieves up to 18x faster inference and an order of magnitude reduction in memory.

Abstract

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

TL;DR

SwiftVLA tackles the practicality gap of Vision-Language-Action models by enabling 4D spatiotemporal reasoning in a compact VLM. It uses a pretrained 4D visual geometry transformer with a temporal cache to produce 4D features, fuses them with 2D cues via learnable Fusion Tokens, and trains with a mask-and-reconstruct objective to distill 4D knowledge into a lightweight VLA. The fusion tokens are supervised by the end-effector future trajectory, aligning multimodal representations for action generation, while inference drops the 4D branch to keep overhead low. Across RoboTwin 2.0 and LIBERO benchmarks, SwiftVLA matches or surpasses larger models, and on edge devices achieves up to 18x faster inference and an order of magnitude reduction in memory.

Abstract

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

Paper Structure

This paper contains 20 sections, 9 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Large VLMs like PaliGemma-3B paligemma excel in spatial reasoning over small VLMs smolvlm, with correct answers in green and incorrect ones in red. This performance advantage allows $\pi_0$pi_0 based on it to achieve a higher success rate, despite slower inference speed compared to the SmolVLA smolvla based on a small VLM. However, SwiftVLA enhances spatiotemporal dynamics for small VLA models while preserving the speed advantages. The success rate and speed are tested on the NVIDIA Jetson Orin nvidia_jetson_orin.
  • Figure 2: (a) Using only 2D features as input to the VLM pi_0openvla, which results in limited spatiotemporal awareness. (b) Direct fusion approaches combine spatial and 2D features within large VLMs bhat20253dlin2025evo3dvla. (c) Decoupled designs that introduce a dedicated spatial branch geovlapointvla, causing large parameter overhead. (d) SwiftVLA leverages a pretrained model streamvggt to extract 4D features and applies a feature reconstruction objective to align 4D and 2D representations. In addition, Fusion Tokens and a future prediction objective are introduced to strengthen cross-modal integration. The 4D inputs and auxiliary heads are removed at inference to maintain efficiency.
  • Figure 3: The pipeline of the SwiftVLA. We first extract 2D and 4D features from input images. A lightweight VLM smolvlm processes 2D and 4D features with Fusion Tokens to achieve cross-modal integration. The outputs of the Fusion Tokens are supervised by the robot end-effector’s future trajectory. During training, we randomly mask either the 2D or the 4D features, and we require the action expert to reconstruct the masked features while learning to generate actions. We show the attention mask under random masking of the 4D features. In this case, 4D features are excluded from the VLM attention, and the model is required to reconstruct the 4D features from the others.
  • Figure 4: The process of 4D feature extraction. At each step, we sequentially process multi-view observations and load contextual information from the cache for temporal attention. The generated 4D features are updated to the cache and delivered to the VLM.
  • Figure 5: Comparison of SmolVLA and SwiftVLA under identical initial poses. During execution, SmolVLA fails to grasp accurately, as the end-effector misses the target and collides with the object, causing it to shift and posing safety risks. In contrast, SwiftVLA successfully completed the grasp with accurate positioning and stable control, demonstrating superior performance.
  • ...and 4 more figures