Table of Contents
Fetching ...

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

Mingfei Han, Linjie Yang, Xiaojun Chang, Lina Yao, Heng Wang

TL;DR

Shot2Story introduces a large-scale multi-shot video benchmark with shot-level visual and narration captions, long video summaries, and a dedicated QA task to probe temporal, holistic, and audio-based understanding. It combines frame-level captions, GPT-4–generated summaries, and thorough human verification, enabling robust evaluation of captioning, summarization, and QA models across modalities. The results show ASR is crucial for joint understanding, shot-structured processing outperforms holistic approaches, and summaries can generalize to other datasets for zero-shot QA, highlighting the practical impact for advanced multi-modal video understanding. The dataset and code pave the way for future work in video grounding and conversation grounded in rich, structured textual representations.

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.

Shot2Story: A New Benchmark for Comprehensive Understanding of Multi-shot Videos

TL;DR

Shot2Story introduces a large-scale multi-shot video benchmark with shot-level visual and narration captions, long video summaries, and a dedicated QA task to probe temporal, holistic, and audio-based understanding. It combines frame-level captions, GPT-4–generated summaries, and thorough human verification, enabling robust evaluation of captioning, summarization, and QA models across modalities. The results show ASR is crucial for joint understanding, shot-structured processing outperforms holistic approaches, and summaries can generalize to other datasets for zero-shot QA, highlighting the practical impact for advanced multi-modal video understanding. The dataset and code pave the way for future work in video grounding and conversation grounded in rich, structured textual representations.

Abstract

A short clip of video may contain progression of multiple events and an interesting story line. A human need to capture both the event in every shot and associate them together to understand the story behind it. In this work, we present a new multi-shot video understanding benchmark Shot2Story with detailed shot-level captions, comprehensive video summaries and question-answering pairs. To facilitate better semantic understanding of videos, we provide captions for both visual signals and human narrations. We design several distinct tasks including single-shot video captioning, multi-shot video summarization, and multi-shot video question answering. Preliminary experiments show some challenges to generate a long and comprehensive video summary for multi-shot videos. Nevertheless, the generated imperfect summaries can already achieve competitive performance on existing video understanding tasks such as video question-answering, promoting an under-explored setting of video understanding with detailed summaries.
Paper Structure (31 sections, 21 figures, 5 tables)

This paper contains 31 sections, 21 figures, 5 tables.

Figures (21)

  • Figure 1: An annotated example of our Shot2Story with sing-shot visual captions and narration captions. Moreover, we provide coherent and reasonable video summaries, and question-answering pairs to facilitate comprehensive understanding of multi-shot videos.
  • Figure 2: Statistics of Shot2Story . Our dataset features detailed visual captions and narration captions, and video summaries, highlighting video progressions, transitions, camera cuts and narration descriptions, with statistics of frequent expressions depicted in the figure.
  • Figure 3: Distribution of multi-shot video QA benchmark. Questions from different categories overlap. All-shared means questions fall under all three categories.
  • Figure 4: Model structure for multi-shot video summarization model SUM-shot. We arrange visual tokens sequentially for each single shot and in a multi-shot format to encapsulate multi-shot information. Additionally, ASR text is incorporated for audio-visual video summarization.
  • Figure 5: Example predictions of our models. (a) shows single-shot video captioning results of VideoChat2-C, which incorporates audio and visual content correctly (b) shows multi-shot video summarization of VideoChat2-SUM-shot, with accurate descriptions in green and errors in red, illustrating the model's ability to narrate event sequences (c) shows two sample questions of the video in (b). The answers are based on the generated summary of VideoChat2-SUM-shot.
  • ...and 16 more figures