TimeLogic: A Temporal Logic Benchmark for Video QA
Sirnam Swetha, Hilde Kuehne, Mubarak Shah
TL;DR
TimeLogic QA (TLQA) introduces a scalable framework and benchmark to evaluate temporal logical understanding in VideoQA. It automatically generates QA pairs across 16 temporal-logic categories from four datasets (STAR, Breakfast, AGQA, CrossTask), producing TLQA-S and TLQA-L variants with 32k and 160k QA pairs per dataset. The framework builds per-time-step instance states from annotated videos, uses template questions aligned to temporal operators, and automatically creates positive and negative QA samples for robust evaluation. Zero-shot experiments with leading VideoQA models reveal that boolean questions are particularly challenging and that higher temporal capacity improves performance, underscoring the need for improved temporal reasoning in video-language models.
Abstract
Temporal logical understanding, a core facet of human cognition, plays a pivotal role in capturing complex sequential events and their temporal relationships within videos. This capability is particularly crucial in tasks like Video Question Answering (VideoQA), where the goal is to process visual data over time together with textual data to provide coherent answers. However, current VideoQA benchmarks devote little focus to evaluating this critical skill due to the challenge of annotating temporal logic. Despite the advancement of vision-language models, assessing their temporal logical reasoning powers remains a challenge, primarily due to the lack QA pairs that demand formal, complex temporal reasoning. To bridge this gap, we introduce the TimeLogic QA (TLQA) framework to automatically generate the QA pairs, specifically designed to evaluate the temporal logical understanding. To this end, TLQA leverages temporal annotations from existing video datasets together with temporal operators derived from logic theory to construct questions that test understanding of event sequences and their temporal relationships. TLQA framework is generic and scalable, capable of leveraging both, existing video action datasets with temporal action segmentation annotations, or video datasets with temporal scene graph annotations, to automatically generate temporal logical questions. We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate two VideoQA dataset variants - small (TLQA-S) and large (TLQA-L) - containing 2k and 10k QA pairs for each category, resulting in 32k and 160k total pairs per dataset. We undertake a comprehensive evaluation of leading-edge VideoQA models, employing the TLQA to benchmark their temporal logical understanding capabilities. We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
