Table of Contents
Fetching ...

Open-Ended and Knowledge-Intensive Video Question Answering

Md Zarif Ul Alam, Hamed Zamani

TL;DR

This work addresses knowledge-intensive video question answering by introducing a multi-modal retrieval-augmented generation framework that leverages textual and visual external knowledge to answer both open-ended and multiple-choice questions. By systematically evaluating multiple knowledge sources, retrieval models, and query formulations, the study demonstrates substantial gains—notably a 17.5% improvement on KnowIT VQA MCQ—over baselines that rely solely on video content. Key contributions include a detailed analysis of how subtitles, video captions, and direct video retrieval influence end-to-end KI-VideoQA, the demonstration that combining subtitles and captions yields the best results in fine-tuned settings, and insights into the importance of query design and retrieval depth. The findings advance practical KI-VideoQA by outlining robust guidance for source selection, retrieval strategies, and prompt design, while also highlighting limitations in current video retrieval and opportunities for future work on longer-form content and explainability.

Abstract

Video question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse multi-modal contexts, and the dynamics between query formulation and retrieval result utilization. Our findings reveal that while retrieval augmentation shows promise in improving model performance, its success is heavily dependent on the chosen modality and retrieval methodology. The study also highlights the critical role of query construction and retrieval depth optimization in effective knowledge integration. Through our proposed approach, we achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset, establishing new state-of-the-art performance levels.

Open-Ended and Knowledge-Intensive Video Question Answering

TL;DR

This work addresses knowledge-intensive video question answering by introducing a multi-modal retrieval-augmented generation framework that leverages textual and visual external knowledge to answer both open-ended and multiple-choice questions. By systematically evaluating multiple knowledge sources, retrieval models, and query formulations, the study demonstrates substantial gains—notably a 17.5% improvement on KnowIT VQA MCQ—over baselines that rely solely on video content. Key contributions include a detailed analysis of how subtitles, video captions, and direct video retrieval influence end-to-end KI-VideoQA, the demonstration that combining subtitles and captions yields the best results in fine-tuned settings, and insights into the importance of query design and retrieval depth. The findings advance practical KI-VideoQA by outlining robust guidance for source selection, retrieval strategies, and prompt design, while also highlighting limitations in current video retrieval and opportunities for future work on longer-form content and explainability.

Abstract

Video question answering that requires external knowledge beyond the visual content remains a significant challenge in AI systems. While models can effectively answer questions based on direct visual observations, they often falter when faced with questions requiring broader contextual knowledge. To address this limitation, we investigate knowledge-intensive video question answering (KI-VideoQA) through the lens of multi-modal retrieval-augmented generation, with a particular focus on handling open-ended questions rather than just multiple-choice formats. Our comprehensive analysis examines various retrieval augmentation approaches using cutting-edge retrieval and vision language models, testing both zero-shot and fine-tuned configurations. We investigate several critical dimensions: the interplay between different information sources and modalities, strategies for integrating diverse multi-modal contexts, and the dynamics between query formulation and retrieval result utilization. Our findings reveal that while retrieval augmentation shows promise in improving model performance, its success is heavily dependent on the chosen modality and retrieval methodology. The study also highlights the critical role of query construction and retrieval depth optimization in effective knowledge integration. Through our proposed approach, we achieve a substantial 17.5% improvement in accuracy on multiple choice questions in the KnowIT VQA dataset, establishing new state-of-the-art performance levels.

Paper Structure

This paper contains 25 sections, 1 figure, 6 tables.

Figures (1)

  • Figure 1: An overview of our multi-modal retrieval augmentation pipeline for KI-VideoQA tasks.