FriendsQA: A New Large-Scale Deep Video Understanding Dataset with Fine-grained Topic Categorization for Story Videos
Zhengqian Wu, Ruizhe Li, Zijun Xu, Zhongyuan Wang, Chunxia Xiao, Chao Liang
TL;DR
The paper addresses the gap in evaluating deep video understanding for story videos by introducing FriendsQA, a large-scale DVU dataset derived from the sitcom Friends. It presents StoryMind, a multi-agent LLM framework that automatically generates and filtrates questions across 14 fine-grained topics and includes a cross-episode question mechanism and a two-factor difficulty measure. FriendsQA comprises 44.6K questions over 234 episodes (average length 1,358s), enabling balanced topic coverage and DVU-focused evaluation. Ten state-of-the-art VideoQA models, spanning VLMs and MLLMs, are benchmarked to reveal that MLLMs generally outperform VLMs and to analyze topic-specific and difficulty-driven performance, underscoring DVU challenges and the value of ground-truth storyline cues. The work provides a scalable methodology for automatic DVU data creation and a comprehensive framework for assessing long-range storyline understanding in video QA.
Abstract
Video question answering (VideoQA) aims to answer natural language questions according to the given videos. Although existing models perform well in the factoid VideoQA task, they still face challenges in deep video understanding (DVU) task, which focuses on story videos. Compared to factoid videos, the most significant feature of story videos is storylines, which are composed of complex interactions and long-range evolvement of core story topics including characters, actions and locations. Understanding these topics requires models to possess DVU capability. However, existing DVU datasets rarely organize questions according to these story topics, making them difficult to comprehensively assess VideoQA models' DVU capability of complex storylines. Additionally, the question quantity and video length of these dataset are limited by high labor costs of handcrafted dataset building method. In this paper, we devise a large language model based multi-agent collaboration framework, StoryMind, to automatically generate a new large-scale DVU dataset. The dataset, FriendsQA, derived from the renowned sitcom Friends with an average episode length of 1,358 seconds, contains 44.6K questions evenly distributed across 14 fine-grained topics. Finally, We conduct comprehensive experiments on 10 state-of-the-art VideoQA models using the FriendsQA dataset.
