Table of Contents
Fetching ...

Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang, Shengqiong Wu, Shutong Hu, Hongbo Qiu, Yu Wang, Guijia Zhang, Tan Kai Ze, Hao Fei, Chia-Wen Lin, Mong-Li Lee, Wynne Hsu

TL;DR

This work introduces Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes, and constructs SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation.

Abstract

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

Spatial Causal Prediction in Video

TL;DR

This work introduces Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes, and constructs SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation.

Abstract

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.
Paper Structure (65 sections, 21 figures, 17 tables)

This paper contains 65 sections, 21 figures, 17 tables.

Figures (21)

  • Figure 1: Existing benchmarks primarily assess known static or known dynamic reasoning based on fully observable scenes. A more challenging dynamic–unseen setting is to evaluate models' ability to predict spatial outcomes from partial observations.
  • Figure 2: Overview of SCP-Bench. Left: Representative examples illustrating the eight task categories. Right: Data distribution across scene categories and task types. The benchmark comprises 2,500 QA pairs over 1,181 video clips.
  • Figure 3: Overview of the SCP-Bench construction pipeline. The process comprises five stages: (1) collection of diverse video sources, (2) clip selection with spatially dynamic segments, (3) generation of candidate QA pairs, (4) QA filtering and cutpoint identification, and (5) dataset validation and refinement.
  • Figure 4: Results across perspectives, view directions, and scenes.
  • Figure 5: Temporal extrapolation horizon analysis. Samples are grouped by the time gap between the cutpoint and future event: short (0–2s), mid (2–5s), and long ($>$ 5s).
  • ...and 16 more figures