Table of Contents
Fetching ...

ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos

Arpan Phukan, Manish Gupta, Asif Ekbal

TL;DR

This work focuses on the generation of entity-centric information-seeking questions from videos, and proposes a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation.

Abstract

Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending ``People Also Ask'' questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available. https://github.com/thePhukan/ECIS-VQG

ECIS-VQG: Generation of Entity-centric Information-seeking Questions from Videos

TL;DR

This work focuses on the generation of entity-centric information-seeking questions from videos, and proposes a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation.

Abstract

Previous studies on question generation from videos have mostly focused on generating questions about common objects and attributes and hence are not entity-centric. In this work, we focus on the generation of entity-centric information-seeking questions from videos. Such a system could be useful for video-based learning, recommending ``People Also Ask'' questions, video-based chatbots, and fact-checking. Our work addresses three key challenges: identifying question-worthy information, linking it to entities, and effectively utilizing multimodal signals. Further, to the best of our knowledge, there does not exist a large-scale dataset for this task. Most video question generation datasets are on TV shows, movies, or human activities or lack entity-centric information-seeking questions. Hence, we contribute a diverse dataset of YouTube videos, VideoQuestions, consisting of 411 videos with 2265 manually annotated questions. We further propose a model architecture combining Transformers, rich context signals (titles, transcripts, captions, embeddings), and a combination of cross-entropy and contrastive loss function to encourage entity-centric question generation. Our best method yields BLEU, ROUGE, CIDEr, and METEOR scores of 71.3, 78.6, 7.31, and 81.9, respectively, demonstrating practical usability. We make the code and dataset publicly available. https://github.com/thePhukan/ECIS-VQG

Paper Structure

This paper contains 32 sections, 5 figures, 20 tables.

Figures (5)

  • Figure 1: Bing's People Also Ask (PAA) module (accessed Sep 21, 2024) displays a question (second one) along with a relevant video thumbnail. When user clicks on the thumbnail, they land on the most relevant chapter within the video. PAA is an apt application for Entity-centric Information-seeking Video QG systems.
  • Figure 2: Two examples of ECIS QG task. For example-1, although the existing QG model t5_neg_qs generates a grammatically sound question, it lacks key context information like a place (Where is the food cheap?) or subject (Which food item?). In example-2, without the particular chair's name, the question generated by the existing QG model is too broad.
  • Figure 3: Architecture of the proposed method indicating various components like input representations, chapter titles classifier, and Transformer encoder-decoder model. Here, inputs are shown in orange, outputs are in green, models are in blue, and loss functions are in pink. Note that loss computation happens at train time only. Prompt is used for Alpaca only. Cross-attention Transformer layer and video embedding is not used for Alpaca.
  • Figure 4: Length distribution (in words) of chapter title, frame captions, video title, transcript; and duration (in seconds) for NSC (NSCQ+NSCP) chapters in the VideoQuestions dataset
  • Figure 5: ClipCap Caption: travelling through a wormhole in deep space. BLIP Caption: a bright blue nebula with a bright star in the middle