Table of Contents
Fetching ...

TimeLogic: A Temporal Logic Benchmark for Video QA

Sirnam Swetha, Hilde Kuehne, Mubarak Shah

TL;DR

TimeLogic QA (TLQA) introduces a scalable framework and benchmark to evaluate temporal logical understanding in VideoQA. It automatically generates QA pairs across 16 temporal-logic categories from four datasets (STAR, Breakfast, AGQA, CrossTask), producing TLQA-S and TLQA-L variants with 32k and 160k QA pairs per dataset. The framework builds per-time-step instance states from annotated videos, uses template questions aligned to temporal operators, and automatically creates positive and negative QA samples for robust evaluation. Zero-shot experiments with leading VideoQA models reveal that boolean questions are particularly challenging and that higher temporal capacity improves performance, underscoring the need for improved temporal reasoning in video-language models.

Abstract

Temporal logical understanding, a core facet of human cognition, plays a pivotal role in capturing complex sequential events and their temporal relationships within videos. This capability is particularly crucial in tasks like Video Question Answering (VideoQA), where the goal is to process visual data over time together with textual data to provide coherent answers. However, current VideoQA benchmarks devote little focus to evaluating this critical skill due to the challenge of annotating temporal logic. Despite the advancement of vision-language models, assessing their temporal logical reasoning powers remains a challenge, primarily due to the lack QA pairs that demand formal, complex temporal reasoning. To bridge this gap, we introduce the TimeLogic QA (TLQA) framework to automatically generate the QA pairs, specifically designed to evaluate the temporal logical understanding. To this end, TLQA leverages temporal annotations from existing video datasets together with temporal operators derived from logic theory to construct questions that test understanding of event sequences and their temporal relationships. TLQA framework is generic and scalable, capable of leveraging both, existing video action datasets with temporal action segmentation annotations, or video datasets with temporal scene graph annotations, to automatically generate temporal logical questions. We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate two VideoQA dataset variants - small (TLQA-S) and large (TLQA-L) - containing 2k and 10k QA pairs for each category, resulting in 32k and 160k total pairs per dataset. We undertake a comprehensive evaluation of leading-edge VideoQA models, employing the TLQA to benchmark their temporal logical understanding capabilities. We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.

TimeLogic: A Temporal Logic Benchmark for Video QA

TL;DR

TimeLogic QA (TLQA) introduces a scalable framework and benchmark to evaluate temporal logical understanding in VideoQA. It automatically generates QA pairs across 16 temporal-logic categories from four datasets (STAR, Breakfast, AGQA, CrossTask), producing TLQA-S and TLQA-L variants with 32k and 160k QA pairs per dataset. The framework builds per-time-step instance states from annotated videos, uses template questions aligned to temporal operators, and automatically creates positive and negative QA samples for robust evaluation. Zero-shot experiments with leading VideoQA models reveal that boolean questions are particularly challenging and that higher temporal capacity improves performance, underscoring the need for improved temporal reasoning in video-language models.

Abstract

Temporal logical understanding, a core facet of human cognition, plays a pivotal role in capturing complex sequential events and their temporal relationships within videos. This capability is particularly crucial in tasks like Video Question Answering (VideoQA), where the goal is to process visual data over time together with textual data to provide coherent answers. However, current VideoQA benchmarks devote little focus to evaluating this critical skill due to the challenge of annotating temporal logic. Despite the advancement of vision-language models, assessing their temporal logical reasoning powers remains a challenge, primarily due to the lack QA pairs that demand formal, complex temporal reasoning. To bridge this gap, we introduce the TimeLogic QA (TLQA) framework to automatically generate the QA pairs, specifically designed to evaluate the temporal logical understanding. To this end, TLQA leverages temporal annotations from existing video datasets together with temporal operators derived from logic theory to construct questions that test understanding of event sequences and their temporal relationships. TLQA framework is generic and scalable, capable of leveraging both, existing video action datasets with temporal action segmentation annotations, or video datasets with temporal scene graph annotations, to automatically generate temporal logical questions. We leverage 4 datasets, STAR, Breakfast, AGQA, and CrossTask, and generate two VideoQA dataset variants - small (TLQA-S) and large (TLQA-L) - containing 2k and 10k QA pairs for each category, resulting in 32k and 160k total pairs per dataset. We undertake a comprehensive evaluation of leading-edge VideoQA models, employing the TLQA to benchmark their temporal logical understanding capabilities. We assess the VideoQA model's temporal reasoning performance on 16 categories of temporal logic with varying temporal complexity.
Paper Structure (15 sections, 4 figures, 5 tables)

This paper contains 15 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Framework Overview. Given existing video datasets with either dense scene graph annotations or temporal action annotations as illustrated, our framework will automatically generate QA pairs for temporal logic with varying complexity. First, we build instance states representing the overall action at each time step ($t_i$) throughout the video as shown. For each temporal category as shown, we generate all positive questions ($P_Q$) valid for the video satisfying the temporal logic definition. Then we generate all possible questions ($A_Q$) by taking all possible action for the video/dataset. The negative questions are the sampled from $A_Q$-$P_Q$ .
  • Figure 2: Temporal Intervals for two actions X, Y. $t_i$: time step.
  • Figure 3: Baseline comparison for multiple-choice TLQA. We provide blank frames to SeViLA as a baseline to evaluate the performance on multiple-choice TLQA benchmark. MC: Multiple-Choice.
  • Figure 4: Qualitative Results for Multiple-Choice QA: It shows that in some cases, scene or object information might correlate with the correct answer, thus resulting in a easier setup compared to binary QA.