Grounded Question-Answering in Long Egocentric Videos

Shangzhe Di; Weidi Xie

Grounded Question-Answering in Long Egocentric Videos

Shangzhe Di, Weidi Xie

TL;DR

This work tackles grounded question answering in long egocentric videos by proposing GroundVQA, a unified model that simultaneously localizes the relevant temporal window and generates an answer. GroundVQA fuses video and language through a visual-language encoder and employs a dual-headed design for temporal localization and answer generation, trained jointly on three tasks: video-language grounding, OpenQA, and CloseQA. To overcome data scarcity, the authors generate large-scale QA data from Ego4D narrations using large language models, yielding EgoTimeQA with 5,389 videos and 303K QA samples, which improves grounding and QA performance while reducing overfitting. They also introduce CloseQA for reliable evaluation of open-ended QA challenges. The approach achieves state-of-the-art results on QaEgo4D and Ego4D-NLQ benchmarks, highlighting its potential for real-world applications in episodic memory and robotics, and demonstrates the effectiveness of unified multi-task training and LLM-based data augmentation. $T=(s,e)$ denotes the grounded temporal window, and the model optimizes a joint loss $L = 0.5 L_{ ext{VLG}} + 0.5 L_{ ext{QA}}$, enabling robust visual-language grounding and QA.

Abstract

Existing approaches to video understanding, mainly designed for short videos from a third-person perspective, are limited in their applicability in certain fields, such as robotics. In this paper, we delve into open-ended question-answering (QA) in long, egocentric videos, which allows individuals or robots to inquire about their own past visual experiences. This task presents unique challenges, including the complexity of temporally grounding queries within extensive video content, the high resource demands for precise data annotation, and the inherent difficulty of evaluating open-ended answers due to their ambiguous nature. Our proposed approach tackles these challenges by (i) integrating query grounding and answering within a unified model to reduce error propagation; (ii) employing large language models for efficient and scalable data synthesis; and (iii) introducing a close-ended QA task for evaluation, to manage answer ambiguity. Extensive experiments demonstrate the effectiveness of our method, which also achieves state-of-the-art performance on the QaEgo4D and Ego4D-NLQ benchmarks. Code, data, and models are available at https://github.com/Becomebright/GroundVQA.

Grounded Question-Answering in Long Egocentric Videos

TL;DR

denotes the grounded temporal window, and the model optimizes a joint loss

, enabling robust visual-language grounding and QA.

Abstract

Paper Structure (24 sections, 2 equations, 10 figures, 7 tables)

This paper contains 24 sections, 2 equations, 10 figures, 7 tables.

Introduction
Related Work
Method
Task Definition
A Multi-tasking Architecture
Generate QA from Narrations
Multi-task Training
Experiments
Dataset and Metrics
Implementation Details
QA Baselines
Ablations
Comparison with State-of-the-art
Conclusion
Llama2 vs. ChatGPT on Data Generation
...and 9 more sections

Figures (10)

Figure 1: We propose a unified model for addressing grounded question answering in long egocentric videos, i.e., simultaneously identifying the temporal window to a question, generating answers in natural language (OpenQA task), or picking answers from candidate choices (CloseQA task).
Figure 2: Overview of GroundVQA. It addresses three tasks: OpenQA, CloseQA, and VLG. The model processes a video $\mathcal{V}$ and a question $\mathcal{Q}$, to reason about the relevant temporal window $\mathcal{T}$ and the answer $\mathcal{A}$. Initially, a frozen video backbone encodes $\mathcal{V}$ and maps it into the language embedding space. Simultaneously, $\mathcal{Q}$ undergoes tokenization and is transformed through an embedding layer. These video and question embeddings are then fused using a visual-language encoder. Finally, a temporal localizer uses the resulting video features to predict $\mathcal{T}$, whereas a language decoder utilizes both video and question features, as provided by the VL encoder, to generate $\mathcal{A}$.
Figure 3: The prompts for generating OpenQA and CloseQA training data with Llama2. (A) First, we generate question-answer pairs using consecutive narration sentences from Ego4D. (B) Next, we generate three plausible yet incorrect answers for each question-answer pair to construct data for the CloseQA task. We provide in-context examples to enhance the generation quality.
Figure 4: Training and validation curves of GroundVQA$_\texttt{S}$. The limited training data of QaEgo4D results in severe overfitting, which is effectively mitigated by our generated EgoTimeQA.
Figure 5: Qualitative examples. In our demonstration, we compare three models: the Oracle baseline, our GroundVQA, and SimpleVQA$^*$. Each column presents a sample that includes the query $\mathcal{Q}$, the ground truth answer $\mathcal{A}$, three frames from the grounded video segment, and the predicted answer $\hat{\mathcal{A}}$. Additionally, each column illustrates the video's time span and the predicted temporal window $\mathcal{T}$, with Oracle's temporal window serving as the ground truth. Note that SimpleVQA$^*$ is incapable of predicting the temporal window.
...and 5 more figures

Grounded Question-Answering in Long Egocentric Videos

TL;DR

Abstract

Grounded Question-Answering in Long Egocentric Videos

Authors

TL;DR

Abstract

Table of Contents

Figures (10)