YTCommentQA: Video Question Answerability in Instructional Videos
Saelyne Yang, Sunghyun Park, Yunseok Jang, Moontae Lee
TL;DR
We introduce YTCommentQA, a dataset of 2,332 real-world questions from 2,004 YouTube instructional videos with labeled answerability categories indicating whether answers can be derived from visual cues, scripts, or both. The paper defines two tasks—Segment Answerability Classification and Video Answerability Classification—to assess how evidence segments and entire videos support answering questions and to identify the required modality. Analysis reveals that most questions are answerable via visual cues or scripts, with substantial overlap, yet a nontrivial portion remains unanswerable and some require both modalities, highlighting the need for robust multimodal reasoning. Experimental evaluations using fine-tuned language models, zero-shot LLMs, and a multimodal model demonstrate the difficulty of predicting answerability and modality, revealing biases in model predictions and emphasizing the importance of integrated video-language understanding for reliable video question answering.
Abstract
Instructional videos provide detailed how-to guides for various tasks, with viewers often posing questions regarding the content. Addressing these questions is vital for comprehending the content, yet receiving immediate answers is difficult. While numerous computational models have been developed for Video Question Answering (Video QA) tasks, they are primarily trained on questions generated based on video content, aiming to produce answers from within the content. However, in real-world situations, users may pose questions that go beyond the video's informational boundaries, highlighting the necessity to determine if a video can provide the answer. Discerning whether a question can be answered by video content is challenging due to the multi-modal nature of videos, where visual and verbal information are intertwined. To bridge this gap, we present the YTCommentQA dataset, which contains naturally-generated questions from YouTube, categorized by their answerability and required modality to answer -- visual, script, or both. Experiments with answerability classification tasks demonstrate the complexity of YTCommentQA and emphasize the need to comprehend the combined role of visual and script information in video reasoning. The dataset is available at https://github.com/lgresearch/YTCommentQA.
