Spatial Causal Prediction in Video

Yanguang Zhao; Jie Yang; Shengqiong Wu; Shutong Hu; Hongbo Qiu; Yu Wang; Guijia Zhang; Tan Kai Ze; Hao Fei; Chia-Wen Lin; Mong-Li Lee; Wynne Hsu

Spatial Causal Prediction in Video

Yanguang Zhao, Jie Yang, Shengqiong Wu, Shutong Hu, Hongbo Qiu, Yu Wang, Guijia Zhang, Tan Kai Ze, Hao Fei, Chia-Wen Lin, Mong-Li Lee, Wynne Hsu

TL;DR

This work introduces Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes, and constructs SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation.

Abstract

Spatial reasoning, the ability to understand spatial relations, causality, and dynamic evolution, is central to human intelligence and essential for real-world applications such as autonomous driving and robotics. Existing studies, however, primarily assess models on visible spatio-temporal understanding, overlooking their ability to infer unseen past or future spatial states. In this work, we introduce Spatial Causal Prediction (SCP), a new task paradigm that challenges models to reason beyond observation and predict spatial causal outcomes. We further construct SCP-Bench, a benchmark comprising 2,500 QA pairs across 1,181 videos spanning diverse viewpoints, scenes, and causal directions, to support systematic evaluation. Through comprehensive experiments on {23} state-of-the-art models, we reveal substantial gaps between human and model performance, limited temporal extrapolation, and weak causal grounding. We further analyze key factors influencing performance and propose perception-enhancement and reasoning-guided strategies toward advancing spatial causal intelligence. The project page is https://guangstrip.github.io/SCP-Bench.

Spatial Causal Prediction in Video

TL;DR

Abstract

Paper Structure (65 sections, 21 figures, 17 tables)

This paper contains 65 sections, 21 figures, 17 tables.

Introduction
Related Work
Spatial-aware MLLMs.
Benchmarking Spatial Intelligence.
SCP-Bench
Overview.
Benchmark Design
Question Type.
Causal Direction.
Perspective Setting.
Scene Diversity.
Construction Process
Setups of Experiments and Analyses
How Well Do Current Models Perform?
Overall Evaluation Results.
...and 50 more sections

Figures (21)

Figure 1: Existing benchmarks primarily assess known static or known dynamic reasoning based on fully observable scenes. A more challenging dynamic–unseen setting is to evaluate models' ability to predict spatial outcomes from partial observations.
Figure 2: Overview of SCP-Bench. Left: Representative examples illustrating the eight task categories. Right: Data distribution across scene categories and task types. The benchmark comprises 2,500 QA pairs over 1,181 video clips.
Figure 3: Overview of the SCP-Bench construction pipeline. The process comprises five stages: (1) collection of diverse video sources, (2) clip selection with spatially dynamic segments, (3) generation of candidate QA pairs, (4) QA filtering and cutpoint identification, and (5) dataset validation and refinement.
Figure 4: Results across perspectives, view directions, and scenes.
Figure 5: Temporal extrapolation horizon analysis. Samples are grouped by the time gap between the cutpoint and future event: short (0–2s), mid (2–5s), and long ($>$ 5s).
...and 16 more figures

Spatial Causal Prediction in Video

TL;DR

Abstract

Spatial Causal Prediction in Video

Authors

TL;DR

Abstract

Table of Contents

Figures (21)