Table of Contents
Fetching ...

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

Qirui Chen, Shangzhe Di, Weidi Xie

TL;DR

The paper tackles grounded multi-hop question answering over long-form egocentric videos by defining MH-VidQA and constructing the MultiHop-EgoQA benchmark via an automated narration-based data-creation pipeline. It introduces GeLM, a grounding-enhanced MLLM that inserts temporal grounding tokens and uses dual grounding branches to retrieve multiple temporal evidences, trained with a combination of QA, saliency, and similarity losses. Empirical results show that existing multi-modal models underperform on multi-hop grounding, while GeLM, when instruction-tuned on the automatically generated data, achieves substantial gains in multi-hop grounding and even state-of-the-art performance on a single-hop VidQA benchmark (ActivityNet-RTL) with third-person data. The work demonstrates the value of automated visual instruction data for advancing instruction-following VLMs and highlights the importance of explicit temporal grounding for robust video-language understanding in long-form, egocentric videos.

Abstract

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models (MLLMs) by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a new baseline for this challenging task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.

Grounded Multi-Hop VideoQA in Long-Form Egocentric Videos

TL;DR

The paper tackles grounded multi-hop question answering over long-form egocentric videos by defining MH-VidQA and constructing the MultiHop-EgoQA benchmark via an automated narration-based data-creation pipeline. It introduces GeLM, a grounding-enhanced MLLM that inserts temporal grounding tokens and uses dual grounding branches to retrieve multiple temporal evidences, trained with a combination of QA, saliency, and similarity losses. Empirical results show that existing multi-modal models underperform on multi-hop grounding, while GeLM, when instruction-tuned on the automatically generated data, achieves substantial gains in multi-hop grounding and even state-of-the-art performance on a single-hop VidQA benchmark (ActivityNet-RTL) with third-person data. The work demonstrates the value of automated visual instruction data for advancing instruction-following VLMs and highlights the importance of explicit temporal grounding for robust video-language understanding in long-form, egocentric videos.

Abstract

This paper considers the problem of Multi-Hop Video Question Answering (MH-VidQA) in long-form egocentric videos. This task not only requires to answer visual questions, but also to localize multiple relevant time intervals within the video as visual evidences. We develop an automated pipeline to create multi-hop question-answering pairs with associated temporal evidence, enabling to construct a large-scale dataset for instruction-tuning. To monitor the progress of this new task, we further curate a high-quality benchmark, MultiHop-EgoQA, with careful manual verification and refinement. Experimental results reveal that existing multi-modal systems exhibit inadequate multi-hop grounding and reasoning abilities, resulting in unsatisfactory performance. We then propose a novel architecture, termed as Grounding Scattered Evidence with Large Language Model (GeLM), that enhances multi-modal large language models (MLLMs) by incorporating a grounding module to retrieve temporal evidence from videos using flexible grounding tokens. Trained on our visual instruction data, GeLM demonstrates improved multi-hop grounding and reasoning capabilities, setting a new baseline for this challenging task. Furthermore, when trained on third-person view videos, the same architecture also achieves state-of-the-art performance on the single-hop VidQA benchmark, ActivityNet-RTL, demonstrating its effectiveness.
Paper Structure (48 sections, 11 equations, 8 figures, 18 tables)

This paper contains 48 sections, 11 equations, 8 figures, 18 tables.

Figures (8)

  • Figure 1: We introduce the problem of Multi-Hop Video Question Answering for long-form egocentric video understanding. This task requires the model to answer questions by gathering and reasoning across scattered visual clues, necessitating the grounding of multiple relevant time spans as supporting evidence.
  • Figure 2: Illustration of our data curation pipeline. To collect large-scale multi-hop VidQA data, we have developed an automated pipeline. We begin by using action scene graphs to identify potential multi-hop reasoning questions based on the syntax trees of annotated narrations. Next, we use $\mathtt{GPT\,\text{-}\,4o}$ to generate data samples that include questions, answers, and relevant time spans. Finally, we perform manual validation and refinement to create the new benchmark.
  • Figure 3: Overview of the proposed architecture.GeLM can generate grounding token pairs, i.e., <T> </T>, in the response of a multi-modal large language model, which denote the start and end times of the enclosed statement. These grounding tokens are then processed with visual hidden states to the ground multiple time spans that provide evidence supporting the answer.
  • Figure 4: Visualization of M ULTI H OP -E GO QA statistics.
  • Figure 5: User interfaces for the manual annotation and the evaluation of human performance on M ULTI H OP -E GO QA.
  • ...and 3 more figures