Table of Contents
Fetching ...

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

Dohwan Ko, Sihyeon Kim, Yumin Suh, Vijay Kumar B. G, Minseo Yoon, Manmohan Chandraker, Hyunwoo J. Kim

TL;DR

The paper addresses the gap in vision-language models' ability to perform spatio-temporal reasoning on dynamic scenes by introducing STKit and STKit-Bench for kinematic instruction tuning. It proposes a scalable pipeline that combines 3D-grounded supervision with a 4D reconstruction-based pseudo-labeling approach to generate rich kinematic QA data, enabling fine-tuning of ST-VLM (a LLaVA-OneVision-based model) to reason about traveled distance, speed, and movement direction. ST-VLM achieves substantial gains over state-of-the-art baselines, generalizes across diverse domains (autonomous driving and sports), and exhibits emergent multi-step reasoning by integrating learned spatio-temporal priors with existing LLM knowledge. The results suggest practical impact for real-world scenarios requiring kinetic understanding, such as autonomous driving, sports analytics, and embodied AI, by enabling accurate, context-aware multi-step reasoning on moving objects.

Abstract

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

ST-VLM: Kinematic Instruction Tuning for Spatio-Temporal Reasoning in Vision-Language Models

TL;DR

The paper addresses the gap in vision-language models' ability to perform spatio-temporal reasoning on dynamic scenes by introducing STKit and STKit-Bench for kinematic instruction tuning. It proposes a scalable pipeline that combines 3D-grounded supervision with a 4D reconstruction-based pseudo-labeling approach to generate rich kinematic QA data, enabling fine-tuning of ST-VLM (a LLaVA-OneVision-based model) to reason about traveled distance, speed, and movement direction. ST-VLM achieves substantial gains over state-of-the-art baselines, generalizes across diverse domains (autonomous driving and sports), and exhibits emergent multi-step reasoning by integrating learned spatio-temporal priors with existing LLM knowledge. The results suggest practical impact for real-world scenarios requiring kinetic understanding, such as autonomous driving, sports analytics, and embodied AI, by enabling accurate, context-aware multi-step reasoning on moving objects.

Abstract

Spatio-temporal reasoning is essential in understanding real-world environments in various fields, eg, autonomous driving and sports analytics. Recent advances have improved the spatial reasoning ability of Vision-Language Models (VLMs) by introducing large-scale data, but these models still struggle to analyze kinematic elements like traveled distance and speed of moving objects. To bridge this gap, we construct a spatio-temporal reasoning dataset and benchmark involving kinematic instruction tuning, referred to as STKit and STKit-Bench. They consist of real-world videos with 3D annotations, detailing object motion dynamics: traveled distance, speed, movement direction, inter-object distance comparisons, and relative movement direction. To further scale such data construction to videos without 3D labels, we propose an automatic pipeline to generate pseudo-labels using 4D reconstruction in real-world scale. With our kinematic instruction tuning data for spatio-temporal reasoning, we present ST-VLM, a VLM enhanced for spatio-temporal reasoning, which exhibits outstanding performance on STKit-Bench. Furthermore, we show that ST-VLM generalizes robustly across diverse domains and tasks, outperforming baselines on other spatio-temporal benchmarks (eg, ActivityNet, TVQA+). Finally, by integrating learned spatio-temporal reasoning with existing abilities, ST-VLM enables complex multi-step reasoning. Project page: https://ikodoh.github.io/ST-VLM.

Paper Structure

This paper contains 17 sections, 1 equation, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Task examples from the proposed STKit-Bench along with predictions from ST-VLM.
  • Figure 2: Movement directions as clockwise directions.
  • Figure 3: Pseudo-label generation pipeline. For the geometric reconstruction branch, a canonicalized 4D scene is reconstructed using MonST3R zhang2024monst3r and Metric3Dv2 hu2024metric3d. For the semantic understanding branch, the object bounding boxes, segmentation masks, and trajectories are extracted using Grounded-SAM2 ren2024grounded. Finally, by integrating each branch, 2D object masks are lifted to 3D and trajectories are computed by tracking 3D barycenters in the 4D scene.
  • Figure 4: Statistics of STKit-Bench. We balance the number of samples for each label to prevent biased results. Red and green bars indicate the number of samples before/after balancing.
  • Figure 5: Comparison on LLaVA-OneVision and ST-VLM for spatio-temporal understanding. mIoU is multiplied by 100.
  • ...and 1 more figures