Table of Contents
Fetching ...

VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation

Arpan Phukan, Anupam Pandey, Deepjyoti Bodo, Asif Ekbal

TL;DR

This work addresses the gap in multimodal question generation by introducing VideoChain, a transformer-based framework for multi-hop video question generation (MVQG) that grounds questions across temporally separated video segments using VideoMAE embeddings and a modified BART backbone. It presents MVQ-60, the first large-scale MVQG dataset, created by automatically merging zero-hop TVQA+ pairs to force cross-segment reasoning across six TV shows. The architecture comprises a modular two-component design that first generates zero-hop questions per segment and then composes them into multi-hop questions, trained in two stages with cross-modal fusion and beam search for coherent generation. Empirical results show VideoChain achieving strong automatic metrics (ROUGE-L 0.6454, ROUGE-1 0.6854, BLEU-1 0.6711, BERTScore-F1 0.7967, semantic similarity 0.8110) and superior human judgments across fluency, relevance, and multi-hop reasoning, highlighting the value of explicit multimodal grounding and modular reasoning for video-based QA tasks.

Abstract

Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain's strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model's ability to generate coherent, contextually grounded, and reasoning-intensive questions.

VideoChain: A Transformer-Based Framework for Multi-hop Video Question Generation

TL;DR

This work addresses the gap in multimodal question generation by introducing VideoChain, a transformer-based framework for multi-hop video question generation (MVQG) that grounds questions across temporally separated video segments using VideoMAE embeddings and a modified BART backbone. It presents MVQ-60, the first large-scale MVQG dataset, created by automatically merging zero-hop TVQA+ pairs to force cross-segment reasoning across six TV shows. The architecture comprises a modular two-component design that first generates zero-hop questions per segment and then composes them into multi-hop questions, trained in two stages with cross-modal fusion and beam search for coherent generation. Empirical results show VideoChain achieving strong automatic metrics (ROUGE-L 0.6454, ROUGE-1 0.6854, BLEU-1 0.6711, BERTScore-F1 0.7967, semantic similarity 0.8110) and superior human judgments across fluency, relevance, and multi-hop reasoning, highlighting the value of explicit multimodal grounding and modular reasoning for video-based QA tasks.

Abstract

Multi-hop Question Generation (QG) effectively evaluates reasoning but remains confined to text; Video Question Generation (VideoQG) is limited to zero-hop questions over single segments. To address this, we introduce VideoChain, a novel Multi-hop Video Question Generation (MVQG) framework designed to generate questions that require reasoning across multiple, temporally separated video segments. VideoChain features a modular architecture built on a modified BART backbone enhanced with video embeddings, capturing textual and visual dependencies. Using the TVQA+ dataset, we automatically construct the large-scale MVQ-60 dataset by merging zero-hop QA pairs, ensuring scalability and diversity. Evaluations show VideoChain's strong performance across standard generation metrics: ROUGE-L (0.6454), ROUGE-1 (0.6854), BLEU-1 (0.6711), BERTScore-F1 (0.7967), and semantic similarity (0.8110). These results highlight the model's ability to generate coherent, contextually grounded, and reasoning-intensive questions.

Paper Structure

This paper contains 28 sections, 4 equations, 2 figures, 10 tables, 1 algorithm.

Figures (2)

  • Figure 1: Example: Merged Multi-hop Question
  • Figure 2: Proposed Model Architecture