Table of Contents
Fetching ...

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, Yu-Chiang Frank Wang

TL;DR

ReXTime proposes a scalable benchmark to evaluate reasoning across time in videos, addressing the gap where questions and answers fall in different segments. It introduces an automated LLM-assisted pipeline that yields 9,695 training samples in addition to 921 validation and 2,143 test samples, alongside a QA-IoU metric to quantify cross-time reasoning and grounding. Benchmark results show frontier multimodal LLMs lag behind human performance on both temporal reasoning VQA and moment localization, with GPT-4o achieving about 73.7% VQA accuracy and humans around 88%, indicating substantial room for improvement. The automatic data-generation approach reduces labeling cost by about 55% and enables effective fine-tuning, making ReXTime a practical driver for advancing cross-time temporal reasoning in video-language models.

Abstract

We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.

ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos

TL;DR

ReXTime proposes a scalable benchmark to evaluate reasoning across time in videos, addressing the gap where questions and answers fall in different segments. It introduces an automated LLM-assisted pipeline that yields 9,695 training samples in addition to 921 validation and 2,143 test samples, alongside a QA-IoU metric to quantify cross-time reasoning and grounding. Benchmark results show frontier multimodal LLMs lag behind human performance on both temporal reasoning VQA and moment localization, with GPT-4o achieving about 73.7% VQA accuracy and humans around 88%, indicating substantial room for improvement. The automatic data-generation approach reduces labeling cost by about 55% and enables effective fine-tuning, making ReXTime a practical driver for advancing cross-time temporal reasoning in video-language models.

Abstract

We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.
Paper Structure (68 sections, 6 figures, 7 tables)

This paper contains 68 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: A ReXTime example. Our benchmark specializes in evaluating reasoning across time, i.e. video QA when question and answer each belong to different time spans. ReXTime poses difficulties even for frontier MLLMs, as indicated by the large gap to human-level accuracy.
  • Figure 2: Overview of the data collection pipeline. In stage I, we collect event pairs from two video sources. In stage II, we score and categorize the event pairs into four relation types. In stage III, the (M)LLM generates a question-answer pair by our carefully written few-shot demonstrations. In stage IV, the LLM self-evaluates the generated samples to reduce the human verification cost.
  • Figure 3: Reasoning across time question-answer types presents the relationship and examples between the three categories of question we generated. "Having dinner / Watching TV" does not have strong causality and is classified in sequential, which often results in before / after questions. "Girls falls down" shows strong causality with "The girl is crying." but lacks human intention, is classified in cause-effect. "Chopping tomato / Making a dish" not only has strong causal relations but also shows subjective deliberation, which is classified into means-to-an-end.
  • Figure 4: Data distribution. We visualize the distribution of our collected question-answer pairs. The pie chart shows the overall percentage of each relation category. The middle histogram shows the distribution of the number of words in a question. The right histogram shows the video duration distribution. The lower number of Cause-Effect samples in ActivityNet can be attributed to the nature of the dataset, which predominantly features human activities. These activities typically involve deliberate actions with specific intentions, leading to a higher percentage of Means-to-an-End instances.
  • Figure 5: We show the GUIs for different annotation / verification processes.
  • ...and 1 more figures