ReXTime: A Benchmark Suite for Reasoning-Across-Time in Videos
Jr-Jen Chen, Yu-Chien Liao, Hsi-Che Lin, Yu-Chu Yu, Yen-Chun Chen, Yu-Chiang Frank Wang
TL;DR
ReXTime proposes a scalable benchmark to evaluate reasoning across time in videos, addressing the gap where questions and answers fall in different segments. It introduces an automated LLM-assisted pipeline that yields 9,695 training samples in addition to 921 validation and 2,143 test samples, alongside a QA-IoU metric to quantify cross-time reasoning and grounding. Benchmark results show frontier multimodal LLMs lag behind human performance on both temporal reasoning VQA and moment localization, with GPT-4o achieving about 73.7% VQA accuracy and humans around 88%, indicating substantial room for improvement. The automatic data-generation approach reduces labeling cost by about 55% and enables effective fine-tuning, making ReXTime a practical driver for advancing cross-time temporal reasoning in video-language models.
Abstract
We introduce ReXTime, a benchmark designed to rigorously test AI models' ability to perform temporal reasoning within video events. Specifically, ReXTime focuses on reasoning across time, i.e. human-like understanding when the question and its corresponding answer occur in different video segments. This form of reasoning, requiring advanced understanding of cause-and-effect relationships across video segments, poses significant challenges to even the frontier multimodal large language models. To facilitate this evaluation, we develop an automated pipeline for generating temporal reasoning question-answer pairs, significantly reducing the need for labor-intensive manual annotations. Our benchmark includes 921 carefully vetted validation samples and 2,143 test samples, each manually curated for accuracy and relevance. Evaluation results show that while frontier large language models outperform academic models, they still lag behind human performance by a significant 14.3% accuracy gap. Additionally, our pipeline creates a training dataset of 9,695 machine generated samples without manual effort, which empirical studies suggest can enhance the across-time reasoning via fine-tuning.
