Table of Contents
Fetching ...

YTCommentQA: Video Question Answerability in Instructional Videos

Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee

TL;DR

We introduce YTCommentQA, a dataset of 2,332 real-world questions from 2,004 YouTube instructional videos with labeled answerability categories indicating whether answers can be derived from visual cues, scripts, or both. The paper defines two tasks—Segment Answerability Classification and Video Answerability Classification—to assess how evidence segments and entire videos support answering questions and to identify the required modality. Analysis reveals that most questions are answerable via visual cues or scripts, with substantial overlap, yet a nontrivial portion remains unanswerable and some require both modalities, highlighting the need for robust multimodal reasoning. Experimental evaluations using fine-tuned language models, zero-shot LLMs, and a multimodal model demonstrate the difficulty of predicting answerability and modality, revealing biases in model predictions and emphasizing the importance of integrated video-language understanding for reliable video question answering.

Abstract

Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.

YTCommentQA: Video Question Answerability in Instructional Videos

TL;DR

We introduce YTCommentQA, a dataset of 2,332 real-world questions from 2,004 YouTube instructional videos with labeled answerability categories indicating whether answers can be derived from visual cues, scripts, or both. The paper defines two tasks—Segment Answerability Classification and Video Answerability Classification—to assess how evidence segments and entire videos support answering questions and to identify the required modality. Analysis reveals that most questions are answerable via visual cues or scripts, with substantial overlap, yet a nontrivial portion remains unanswerable and some require both modalities, highlighting the need for robust multimodal reasoning. Experimental evaluations using fine-tuned language models, zero-shot LLMs, and a multimodal model demonstrate the difficulty of predicting answerability and modality, revealing biases in model predictions and emphasizing the importance of integrated video-language understanding for reliable video question answering.

Abstract

Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.
Paper Structure (43 sections, 6 figures, 6 tables)

This paper contains 43 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: A question on video can be either (1) unanswerable by video, (2) answerable by visual, (3) answerable by script, or (4) answerable when both visual and script are present. The figure shows an example of (4), where the question is answerable with the understanding of both visual and script.
  • Figure 2: Annotation workflow for the video question answerability. Once an annotator identifies that the timestamp used in a reply to a given question suggests an answer in the video, they are provided with visual and script snippets centerd around the timestamp. For questions that could not be answered using visual or script snippets, the annotators are asked whether both were necessary to answer or if the question was unanswerable altogether.
  • Figure 3: Example question and its replies that contain a timestamp.
  • Figure 4: Distribution of the first two words for questions in YTCommentQA, which shows the diversity of the collected questions. The sequence of words begins from the center and extends outward. Words with small font sizes are omitted.
  • Figure A: We provide the video snippets centered around timestamp, aligned with the closest transcript and its corresponding visual content.
  • ...and 1 more figures