Table of Contents
Fetching ...

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

Shijie Zhou, Alexander Vilesov, Xuehai He, Ziyu Wan, Shuwang Zhang, Aditya Nagachandra, Di Chang, Dongdong Chen, Xin Eric Wang, Achuta Kadambi

TL;DR

This work introduces VLM4D, the first benchmark explicitly designed to evaluate spatiotemporal (4D) reasoning in Vision-Language Models, using a mix of real and synthetic videos with QA pairs that stress translational/rotational motion, perspective, and temporal continuity. It demonstrates that contemporary VLMs, including state-of-the-art closed- and open-source models, exhibit substantial gaps relative to human performance and across data types, exposing limitations in 4D grounding and labeling accuracy. The authors analyze root causes—limited 4D cognition and sparse, imprecise spatiotemporal annotations in existing datasets—and propose two promising directions: spatial-temporal supervised fine-tuning and 4D feature-field reconstruction to enhance dynamic understanding. The work provides a foundation for future research toward more robust spatiotemporal grounding in visual-language systems, with potential impact on robotics, interactive AI, and other dynamic-intelligence applications.

Abstract

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

VLM4D: Towards Spatiotemporal Awareness in Vision Language Models

TL;DR

This work introduces VLM4D, the first benchmark explicitly designed to evaluate spatiotemporal (4D) reasoning in Vision-Language Models, using a mix of real and synthetic videos with QA pairs that stress translational/rotational motion, perspective, and temporal continuity. It demonstrates that contemporary VLMs, including state-of-the-art closed- and open-source models, exhibit substantial gaps relative to human performance and across data types, exposing limitations in 4D grounding and labeling accuracy. The authors analyze root causes—limited 4D cognition and sparse, imprecise spatiotemporal annotations in existing datasets—and propose two promising directions: spatial-temporal supervised fine-tuning and 4D feature-field reconstruction to enhance dynamic understanding. The work provides a foundation for future research toward more robust spatiotemporal grounding in visual-language systems, with potential impact on robotics, interactive AI, and other dynamic-intelligence applications.

Abstract

Vision language models (VLMs) have shown remarkable capabilities in integrating linguistic and visual reasoning but remain fundamentally limited in understanding dynamic spatiotemporal interactions. Humans effortlessly track and reason about object movements, rotations, and perspective shifts-abilities essential for robust dynamic real-world understanding yet notably lacking in current VLMs. In this paper, we introduce VLM4D, the first benchmark specifically designed to evaluate the spatiotemporal reasoning capabilities of VLMs. Our benchmark comprises diverse real-world and synthetic videos accompanied by carefully curated question-answer pairs emphasizing translational and rotational motions, perspective awareness, and motion continuity. Through comprehensive evaluations of state-of-the-art open and closed-source VLMs, we identify significant performance gaps compared to human baselines, highlighting fundamental deficiencies in existing models. Extensive analysis reveals that VLMs struggle particularly with integrating multiple visual cues and maintaining temporal coherence. We further explore promising directions, such as leveraging 4D feature field reconstruction and targeted spatiotemporal supervised fine-tuning, demonstrating their effectiveness in enhancing spatiotemporal comprehension. Our work aims to encourage deeper exploration into improving VLMs' spatial and temporal grounding, paving the way towards more capable and reliable visual intelligence for dynamic environments.

Paper Structure

This paper contains 29 sections, 19 figures, 8 tables.

Figures (19)

  • Figure 1: Distribution of Dataset Sources and Annotations. Overview of the dataset composition, illustrating the proportions of real third-person (exo-centric) videos (DAVIS davis2017, YouTube-VOS xu2018youtube), real first-person (ego-centric) videos (Ego4D grauman2022ego4d), and synthetic videos (Cosmos agarwal2025cosmos). The real video data is further categorized by annotation types, including translational, rotational, action, counting, and false positive queries (targeting nonexistent events to assess critical reasoning).
  • Figure 2: Dataset Generation and Annotation Pipeline. Our dataset was constructed by collecting real videos and generating synthetic data, followed by human-in-the-loop quality reviews to address ambiguous videos and annotations. After temporal alignment and quality assurance, human-annotated questions and ground-truth answers were created, complemented by multiple-choice (MC) answers generated by large language models (LLMs). The final dataset includes real and synthetic video data with comprehensive VLM scoring metrics.
  • Figure 3: Qualitative Examples of Dataset Annotations. (Top) A third-person (exo-centric) video with translational annotations ("camel turning left from its perspective"). (Middle) A first-person (ego-centric) video with a rotational question ("clockwise rotation of ladle"). (Bottom) A synthetic scene with motion recognition "robotic dog moving left").
  • Figure 4: Comparison of accuracy across types of spatiotemporal questions. Model accuracy is shown only for the six top-performing VLMs.
  • Figure 5: Comparison of CoT and DO Accuracy Across Models. Accuracy comparison between Chain-of-Thought (CoT) and Direct Output (DO) prompting across VLMs.
  • ...and 14 more figures