Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos
Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang
TL;DR
Shot2Story introduces a large-scale multi-shot video benchmark with shot-level visual and narration captions, long video summaries, and a dedicated QA task to probe temporal, holistic, and audio-based understanding. It combines frame-level captions, GPT-4–generated summaries, and thorough human verification, enabling robust evaluation of captioning, summarization, and QA models across modalities. The results show ASR is crucial for joint understanding, shot-structured processing outperforms holistic approaches, and summaries can generalize to other datasets for zero-shot QA, highlighting the practical impact for advanced multi-modal video understanding. The dataset and code pave the way for future work in video grounding and conversation grounded in rich, structured textual representations.
Abstract
A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.
