Table of Contents
Fetching ...

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

Wentao Wang, Heqing Zou, Tianze Luo, Rui Huang, Yutian Zhao, Zhuochen Wang, Hansheng Zhang, Chengwei Qin, Yan Wang, Lin Zhao, Huaijian Zhang

TL;DR

Video-STR tackles the challenge of precise spatio-temporal reasoning in videos by integrating a graph-based inter-object topology with reinforcement learning via verifiable rewards. It introduces the STV-205k dataset with $205k$ QA pairs to train and evaluate spatial and temporal reasoning in video contexts. The method extends Group Relative Policy Optimization (GRPO) with a graph reasoning module and demonstrates state-of-the-art performance across multiple benchmarks, including a $13\%$ improvement over the base model on STI-Bench. The results validate the RLVR approach and graph-based reasoning as a robust direction for enhancing video spatio-temporal capabilities in Multimodal Large Language Models.

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.

Video-STR: Reinforcing MLLMs in Video Spatio-Temporal Reasoning with Relation Graph

TL;DR

Video-STR tackles the challenge of precise spatio-temporal reasoning in videos by integrating a graph-based inter-object topology with reinforcement learning via verifiable rewards. It introduces the STV-205k dataset with QA pairs to train and evaluate spatial and temporal reasoning in video contexts. The method extends Group Relative Policy Optimization (GRPO) with a graph reasoning module and demonstrates state-of-the-art performance across multiple benchmarks, including a improvement over the base model on STI-Bench. The results validate the RLVR approach and graph-based reasoning as a robust direction for enhancing video spatio-temporal capabilities in Multimodal Large Language Models.

Abstract

Recent progress in Multimodal Large Language Models (MLLMs) has demonstrated strong semantic understanding capabilities, but struggles to perform precise spatio-temporal understanding. Existing spatio-temporal methods primarily focus on the video itself, while overlooking the physical information within the video, such as multi-object layouts and motion. Such limitations restrict the use of MLLMs in downstream applications that demand high precision, including embodied intelligence and VR. To address this issue, we present Video-STR, a novel graph-based reinforcement method for precise Video Spatio-Temporal Reasoning. Building upon the capacity of Reinforcement Learning with Verifiable Reward (RLVR) to improve model abilities, we introduce a reasoning mechanism using graph-based Group Relative Policy Optimization (GRPO) method to guide the model in inferring the underlying spatio-temporal topology of scenarios during the thinking process. To resolve the lack of spatio-temporal training data, we construct the STV-205k dataset with 205k question-answering pairs, covering dynamic multi-object scenes in both indoor and outdoor environments, to support the model training. Experiments show that Video-STR achieves state-of-the-art results on various benchmarks, outperforming the base model by 13% on STI-Bench, and demonstrating the effectiveness of our approach and dataset. Code, model, and data will be released.

Paper Structure

This paper contains 24 sections, 18 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: Overview of Video-STR and previous methods. (a) Previous methods localized using images and 2D cognitive maps with spatial cues. (b) Our method, Video-STR, uses graph-based reasoning and RL to infer spatio-temporal relations and object states in videos.
  • Figure 2: The overview of the constructed dataset.
  • Figure 3: Data statistics of our constructed STV-205k dataset.
  • Figure 4: (a) Performance on sub-tasks. (b) Performance on static and dynamic tasks. (c) Performance in different scenarios.
  • Figure 5: SFT using different datasets.
  • ...and 5 more figures

Theorems & Definitions (2)

  • proof
  • proof