Table of Contents
Fetching ...

Question-Answering Dense Video Events

Hangyu Qin, Junbin Xiao, Angela Yao

TL;DR

Question-Answering Dense Video Events tackles the challenge of understanding and grounding answers for questions about dense events in long videos. It introduces DeVE-QA, a large-scale dataset of 78K questions about 26K events across 10.6K videos, and presents DeVi, a training-free framework that combines hierarchical dense event captioning, temporal memory, and self-consistency checks to enable faithful QA and grounding. Empirical results show substantial gains over prior methods, including a 4.8 percentage-point improvement in grounded QA on DeVE-QA and a 2.1-point improvement on NExT-GQA, with QA accuracy around 71–72% and grounding quality significantly enhanced. The work highlights the importance of multi-scale event modeling, long-range temporal reasoning, and verification in dense-video QA, and provides a solid benchmark and methodology for future zero-shot dense-event reasoning.

Abstract

This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA and NExT-GQA, respectively. Data and code are available at https://github.com/QHUni/DeVE-QA.

Question-Answering Dense Video Events

TL;DR

Question-Answering Dense Video Events tackles the challenge of understanding and grounding answers for questions about dense events in long videos. It introduces DeVE-QA, a large-scale dataset of 78K questions about 26K events across 10.6K videos, and presents DeVi, a training-free framework that combines hierarchical dense event captioning, temporal memory, and self-consistency checks to enable faithful QA and grounding. Empirical results show substantial gains over prior methods, including a 4.8 percentage-point improvement in grounded QA on DeVE-QA and a 2.1-point improvement on NExT-GQA, with QA accuracy around 71–72% and grounding quality significantly enhanced. The work highlights the importance of multi-scale event modeling, long-range temporal reasoning, and verification in dense-video QA, and provides a solid benchmark and methodology for future zero-shot dense-event reasoning.

Abstract

This paper presents question-answering on dense video events, a novel task that answers and grounds dense-event questions in long videos, thus challenging MLLMs to faithfully comprehend and reason about multiple events over extended periods of time. To facilitate the study, we construct DeVE-QA -- a dataset featuring 78K questions about 26K events on 10.6K long videos. Our benchmarking shows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, we propose DeVi, a novel training-free MLLM approach that highlights a hierarchical captioning module, a temporal event memory module, and a self-consistency checking module to respectively detect, contextualize and memorize, and ground dense-events in long videos for question answering. Extensive experiments show that DeVi is superior at answering dense-event questions and grounding relevant video moments. Compared with existing MLLMs, it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy on DeVE-QA and NExT-GQA, respectively. Data and code are available at https://github.com/QHUni/DeVE-QA.
Paper Structure (21 sections, 1 equation, 8 figures, 12 tables)

This paper contains 21 sections, 1 equation, 8 figures, 12 tables.

Figures (8)

  • Figure 1: Question answering dense video events on DeVE-QA (a) vs. single video event on MSRVTT-QA (b).
  • Figure 2: DeVE-QA construction pipeline.
  • Figure 3: QA examples in DeVE-QA.
  • Figure 4: DeVE-QA analysis. (a) Question distribution in DeVE-QA. (b) Certificate length of VideoQA datasets.
  • Figure 5: DeVi framework: (1) Hierarchical dense event video segmenting and captioning, (2) contextualizing and memorizing events in temporal event memory, and (3) event-grounded video question answering with self-consistency checking.
  • ...and 3 more figures