Table of Contents
Fetching ...

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

Jiahui Zhang, Yurui Chen, Yueming Xu, Ze Huang, Yanpeng Zhou, Yu-Jie Yuan, Xinyue Cai, Guowei Huang, Xingyue Quan, Hang Xu, Li Zhang

TL;DR

We address pretraining inefficiency on diverse robotic data caused by coordinate and state chaos from limited inputs. The proposed 4D-VLA integrates 4D spatiotemporal information by converting RGB-D sequences into 3D-aware spatial tokens, fusing with language, and decoding actions with an efficient memory bank sampling strategy and temporal encodings. This yields stronger spatiotemporal reasoning and significantly improves performance over OpenVLA across LIBERO tasks, MV-Bench, and real-world experiments, while demonstrating robust generalization to novel viewpoints. The work also introduces MV-Bench to assess spatial understanding and viewpoint generalization, underscoring the practical impact of robust 4D perception for embodied AI.

Abstract

Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.

4D-VLA: Spatiotemporal Vision-Language-Action Pretraining with Cross-Scene Calibration

TL;DR

We address pretraining inefficiency on diverse robotic data caused by coordinate and state chaos from limited inputs. The proposed 4D-VLA integrates 4D spatiotemporal information by converting RGB-D sequences into 3D-aware spatial tokens, fusing with language, and decoding actions with an efficient memory bank sampling strategy and temporal encodings. This yields stronger spatiotemporal reasoning and significantly improves performance over OpenVLA across LIBERO tasks, MV-Bench, and real-world experiments, while demonstrating robust generalization to novel viewpoints. The work also introduces MV-Bench to assess spatial understanding and viewpoint generalization, underscoring the practical impact of robust 4D perception for embodied AI.

Abstract

Leveraging diverse robotic data for pretraining remains a critical challenge. Existing methods typically model the dataset's action distribution using simple observations as inputs. However, these inputs are often incomplete, resulting in a dispersed conditional action distribution-an issue we refer to as coordinate system chaos and state chaos. This inconsistency significantly hampers pretraining efficiency. To address this, we propose 4D-VLA, a novel approach that effectively integrates 4D information into the input to mitigate these sources of chaos. Our model introduces depth and temporal information into visual features with sequential RGB-D inputs, aligning the coordinate systems of the robot and the scene. This alignment endows the model with strong spatiotemporal reasoning capabilities while minimizing training overhead. Additionally, we introduce memory bank sampling, a frame sampling strategy designed to extract informative frames from historical images, further improving effectiveness and efficiency. Experimental results demonstrate that our pretraining method and architectural components substantially enhance model performance. In both simulated and real-world experiments, our model achieves a significant increase in success rate over OpenVLA. To further assess spatial perception and generalization to novel views, we introduce MV-Bench, a multi-view simulation benchmark. Our model consistently outperforms existing methods, demonstrating stronger spatial understanding and adaptability.

Paper Structure

This paper contains 33 sections, 5 equations, 8 figures, 8 tables, 1 algorithm.

Figures (8)

  • Figure 1: Top: Our pretraining design philosophy highlights that prior methods often lack key cues in their input for accurate action inference. This leads to target action distributions $A_t(\cdot)$ exhibiting high variance or non-smoothness, which negatively impacts pretraining performance. A rough analysis shows that in the DROID dataset, 67% of the samples have the robot’s base occluded, causing coordinate system chaos. Bottom: We verify our method in both simulated and real-world robotic settings and report the performance for the OpenVLA baseline and our 4D-VLA approach.
  • Figure 2: Our 4D-VLA pipeline. Our memory bank sampling method selects informative frames from sequential RGB-D inputs. A vision encoder with 3D coordinate embeddings generates spatial-aware tokens, which are fused into a 4D spatiotemporal representation. Combined with text tokens, these are processed by the LLM to decode actions via an action head.
  • Figure 3: Our MV-Bench camera setting. We select 6 diverse viewpoints as training views and render images for all LIBERO-SPATIAL tasks. Novel inference views are placed near the training views. To avoid occlusion from the black box, test views in blocked areas are excluded.
  • Figure 4: Our real-world experiment settings. These settings aim to evaluate the model’s spatial generalization, robustness to distractors, precision in placement, and ability to follow instructions. Each row presents a 3-frame execution snapshot.
  • Figure 5: Our multi-view real-world experiment settings. These settings aim to evaluate the model’s out-of-distribution and novel-view generalization ability.
  • ...and 3 more figures