V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning
Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, Shaogang Gong
TL;DR
This work tackles the gap in evaluating Video-LLMs' ability to perform integrated spatio-temporal reasoning in videos. It introduces V-STaR, a benchmark built around Reverse Spatio-Temporal Reasoning (RSTR) and coarse-to-fine Chain-of-Thought chains, augmented by a Logarithmic Geometric Mean (LGM) scoring metric to capture reasoning quality across what, when, and where. A semi-automatic GPT-4-driven pipeline creates a rich dataset with explicit reasoning traces and two RSTR chains to probe reasoning order. Evaluations across 14 Video-LLMs reveal that while some models excel at identifying objects, many struggle with grounding in time and space or with maintaining coherent reasoning across long sequences, highlighting substantial gaps and guiding future improvements in spatio-temporal understanding.
Abstract
Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.
