Table of Contents
Fetching ...

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

Xingcheng Zhou, Konstantinos Larintzakis, Hao Guo, Walter Zimmer, Mingyu Liu, Hu Cao, Jiajie Zhang, Venkatnarayanan Lakshminarasimhan, Leah Strand, Alois C. Knoll

TL;DR

TUMTraffic-VideoQA introduces a unified benchmark for spatio-temporal video understanding in roadside traffic scenes, addressing a gap in multi-task, third-person traffic data. It provides a large-scale dataset with three tasks—Multi-Choice QA, Spatio-Temporal Grounding, and Referred Object Captioning—under a common evaluation framework, and adopts a novel object representation $ (c, f_n, x, y) $ to enable cross-frame grounding. The paper also presents TUMTraffic-Qwen, a unified baseline that fuses a SigLIP visual encoder, token-sampling strategies, and an instruction-tuned LLM (Qwen-2) to handle long video inputs. Through extensive experiments, the study reveals the dataset’s complexity and the current limits of zero-shot VLMs on fine-grained traffic reasoning, while showing how multi-resolution token sampling can improve certain tasks and introduce trade-offs in others. This work provides a publicly available, robust platform for advancing traffic video understanding and facilitating development of next-generation traffic foundation models.

Abstract

We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity, highlight the limitations of existing models, and position TUMTraffic-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

TUMTraffic-VideoQA: A Benchmark for Unified Spatio-Temporal Video Understanding in Traffic Scenes

TL;DR

TUMTraffic-VideoQA introduces a unified benchmark for spatio-temporal video understanding in roadside traffic scenes, addressing a gap in multi-task, third-person traffic data. It provides a large-scale dataset with three tasks—Multi-Choice QA, Spatio-Temporal Grounding, and Referred Object Captioning—under a common evaluation framework, and adopts a novel object representation to enable cross-frame grounding. The paper also presents TUMTraffic-Qwen, a unified baseline that fuses a SigLIP visual encoder, token-sampling strategies, and an instruction-tuned LLM (Qwen-2) to handle long video inputs. Through extensive experiments, the study reveals the dataset’s complexity and the current limits of zero-shot VLMs on fine-grained traffic reasoning, while showing how multi-resolution token sampling can improve certain tasks and introduce trade-offs in others. This work provides a publicly available, robust platform for advancing traffic video understanding and facilitating development of next-generation traffic foundation models.

Abstract

We present TUMTraffic-VideoQA, a novel dataset and benchmark designed for spatio-temporal video understanding in complex roadside traffic scenarios. The dataset comprises 1,000 videos, featuring 85,000 multiple-choice QA pairs, 2,300 object captioning, and 5,700 object grounding annotations, encompassing diverse real-world conditions such as adverse weather and traffic anomalies. By incorporating tuple-based spatio-temporal object expressions, TUMTraffic-VideoQA unifies three essential tasks-multiple-choice video question answering, referred object captioning, and spatio-temporal object grounding-within a cohesive evaluation framework. We further introduce the TUMTraffic-Qwen baseline model, enhanced with visual token sampling strategies, providing valuable insights into the challenges of fine-grained spatio-temporal reasoning. Extensive experiments demonstrate the dataset's complexity, highlight the limitations of existing models, and position TUMTraffic-VideoQA as a robust foundation for advancing research in intelligent transportation systems. The dataset and benchmark are publicly available to facilitate further exploration.

Paper Structure

This paper contains 30 sections, 7 equations, 26 figures, 6 tables.

Figures (26)

  • Figure 1: TUMTraffic-VideoQA introduces a comprehensive benchmark for video-level traffic scene understanding. Our baseline model, TraffiX-Qwen, is capable of solving multiple tasks, including video QA, spatio-temporal grounding, and referred object captioning, within a unified model. In our approach, the spatio-temporal location of objects is represented as tuples $(c, fn, x, y)$, where $c$ serves as a unique object identifier, $fn$ denotes the normalized frame timestamp, and $(x, y)$ denote the center of the object in the image, normalized with respect to the image dimensions.
  • Figure 2: Different methods for describing objects in images and videos using language expressions. We adopt a tuple-based spatio-temporal object representation for the unique object reference, as shown in (d).
  • Figure 3: The workflow of the semi-automatic annotation pipeline for TUMTraffic-VideoQA generation, integrating external database, leveraging various off-the-shelf tools and LLMs, with human quality checks ensuring accuracy.
  • Figure 4: Statistical distributions of the dataset, including word counts in questions and answers, distribution of question types, and temporal window lengths for object grounding.
  • Figure 5: Overview of the TUMTraffic-Qwen baseline model. Yellow and orange colors represent the combination of multi-resolution visual tokens from different visual strategies, while blue indicates textual tokens.
  • ...and 21 more figures