Table of Contents
Fetching ...

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

Zixu Cheng, Jian Hu, Ziquan Liu, Chenyang Si, Wei Li, Shaogang Gong

TL;DR

This work tackles the gap in evaluating Video-LLMs' ability to perform integrated spatio-temporal reasoning in videos. It introduces V-STaR, a benchmark built around Reverse Spatio-Temporal Reasoning (RSTR) and coarse-to-fine Chain-of-Thought chains, augmented by a Logarithmic Geometric Mean (LGM) scoring metric to capture reasoning quality across what, when, and where. A semi-automatic GPT-4-driven pipeline creates a rich dataset with explicit reasoning traces and two RSTR chains to probe reasoning order. Evaluations across 14 Video-LLMs reveal that while some models excel at identifying objects, many struggle with grounding in time and space or with maintaining coherent reasoning across long sequences, highlighting substantial gaps and guiding future improvements in spatio-temporal understanding.

Abstract

Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

V-STaR: Benchmarking Video-LLMs on Video Spatio-Temporal Reasoning

TL;DR

This work tackles the gap in evaluating Video-LLMs' ability to perform integrated spatio-temporal reasoning in videos. It introduces V-STaR, a benchmark built around Reverse Spatio-Temporal Reasoning (RSTR) and coarse-to-fine Chain-of-Thought chains, augmented by a Logarithmic Geometric Mean (LGM) scoring metric to capture reasoning quality across what, when, and where. A semi-automatic GPT-4-driven pipeline creates a rich dataset with explicit reasoning traces and two RSTR chains to probe reasoning order. Evaluations across 14 Video-LLMs reveal that while some models excel at identifying objects, many struggle with grounding in time and space or with maintaining coherent reasoning across long sequences, highlighting substantial gaps and guiding future improvements in spatio-temporal understanding.

Abstract

Human processes video reasoning in a sequential spatio-temporal reasoning logic, we first identify the relevant frames ("when") and then analyse the spatial relationships ("where") between key objects, and finally leverage these relationships to draw inferences ("what"). However, can Video Large Language Models (Video-LLMs) also "reason through a sequential spatio-temporal logic" in videos? Existing Video-LLM benchmarks primarily focus on assessing object presence, neglecting relational reasoning. Consequently, it is difficult to measure whether a model truly comprehends object interactions (actions/events) in videos or merely relies on pre-trained "memory" of co-occurrences as biases in generating answers. In this work, we introduce a Video Spatio-Temporal Reasoning (V-STaR) benchmark to address these shortcomings. The key idea is to decompose video understanding into a Reverse Spatio-Temporal Reasoning (RSTR) task that simultaneously evaluates what objects are present, when events occur, and where they are located while capturing the underlying Chain-of-thought (CoT) logic. To support this evaluation, we construct a dataset to elicit the spatial-temporal reasoning process of Video-LLMs. It contains coarse-to-fine CoT questions generated by a semi-automated GPT-4-powered pipeline, embedding explicit reasoning chains to mimic human cognition. Experiments from 14 Video-LLMs on our V-STaR reveal significant gaps between current Video-LLMs and the needs for robust and consistent spatio-temporal reasoning.

Paper Structure

This paper contains 10 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the semi-automated data construction pipeline of V-STaR. GPT-4 generates a spatio-temporal reasoning CoT chain to answer VQA questions, along with a set of RSTR questions. The RSTR questions are independent temporal or spatial grounding challenges, decomposed from the CoT reasoning chain, designed to evaluate the model’s spatio-temporal reasoning capabilities.
  • Figure 2: Dataset statistics of video domain and length, and visualization of objects in video.
  • Figure 3: An example illustrating the construction of CoT questions. Each sample contains a thinking chain and two RSTR question chains.
  • Figure 4: The performance of each domain.
  • Figure 5: An example showcasing the performance of five models.