Table of Contents
Fetching ...

Zero-Shot Long-Form Video Understanding through Screenplay

Yongliang Wu, Bozheng Li, Jiawang Cao, Wenbo Zhu, Yi Lu, Weiheng Chi, Chuyun Xie, Haolin Zheng, Ziyue Su, Jay Wu, Xu Yang

TL;DR

The paper addresses the challenge of long-form video question answering (LVQA) by converting arbitrary-length videos into scene-level screenplay representations through a multimodal perception pipeline, and by applying a Look Back mechanism to augment uncertain breakpoint answers with targeted visual frames. The core contributions are: (i) Scene-Level Script Generation that merges shots into coherent scenes using LLMs, (ii) a multimodal perception module that fuses visual captions, transcripts, and audio events, and (iii) the Look Back for Determination strategy to improve breakpoint-mode performance. The approach achieves state-of-the-art results on the MovieChat-1K LVQA benchmark, with global accuracy of 87.5% and breakpoint accuracy of 68.8%, demonstrating the effectiveness of narrative-level representations plus selective visual augmentation for long-form understanding. This work suggests a scalable pathway for LVQA by leveraging high-level storytelling and targeted visual evidence, enabling accurate answering without task-specific training on large datasets.

Abstract

The Long-form Video Question-Answering task requires the comprehension and analysis of extended video content to respond accurately to questions by utilizing both temporal and contextual information. In this paper, we present MM-Screenplayer, an advanced video understanding system with multi-modal perception capabilities that can convert any video into textual screenplay representations. Unlike previous storytelling methods, we organize video content into scenes as the basic unit, rather than just visually continuous shots. Additionally, we developed a ``Look Back'' strategy to reassess and validate uncertain information, particularly targeting breakpoint mode. MM-Screenplayer achieved highest score in the CVPR'2024 LOng-form VidEo Understanding (LOVEU) Track 1 Challenge, with a global accuracy of 87.5% and a breakpoint accuracy of 68.8%.

Zero-Shot Long-Form Video Understanding through Screenplay

TL;DR

The paper addresses the challenge of long-form video question answering (LVQA) by converting arbitrary-length videos into scene-level screenplay representations through a multimodal perception pipeline, and by applying a Look Back mechanism to augment uncertain breakpoint answers with targeted visual frames. The core contributions are: (i) Scene-Level Script Generation that merges shots into coherent scenes using LLMs, (ii) a multimodal perception module that fuses visual captions, transcripts, and audio events, and (iii) the Look Back for Determination strategy to improve breakpoint-mode performance. The approach achieves state-of-the-art results on the MovieChat-1K LVQA benchmark, with global accuracy of 87.5% and breakpoint accuracy of 68.8%, demonstrating the effectiveness of narrative-level representations plus selective visual augmentation for long-form understanding. This work suggests a scalable pathway for LVQA by leveraging high-level storytelling and targeted visual evidence, enabling accurate answering without task-specific training on large datasets.

Abstract

The Long-form Video Question-Answering task requires the comprehension and analysis of extended video content to respond accurately to questions by utilizing both temporal and contextual information. In this paper, we present MM-Screenplayer, an advanced video understanding system with multi-modal perception capabilities that can convert any video into textual screenplay representations. Unlike previous storytelling methods, we organize video content into scenes as the basic unit, rather than just visually continuous shots. Additionally, we developed a ``Look Back'' strategy to reassess and validate uncertain information, particularly targeting breakpoint mode. MM-Screenplayer achieved highest score in the CVPR'2024 LOng-form VidEo Understanding (LOVEU) Track 1 Challenge, with a global accuracy of 87.5% and a breakpoint accuracy of 68.8%.
Paper Structure (11 sections, 2 figures, 2 tables)

This paper contains 11 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: The overall architecture of MM-Screenplayer.
  • Figure 2: Comparison of answers produced by MM-Screenplayer and other state-of-the-art methods for a question from MovieChat1K-testset. Our method produced significantly better answers while all other methods' answers were incorrect.