Table of Contents
Fetching ...

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

Weijia Wu, Mingyu Liu, Zeyu Zhu, Xi Xia, Haoen Feng, Wen Wang, Kevin Qinghong Lin, Chunhua Shen, Mike Zheng Shou

TL;DR

MovieBench tackles the lack of benchmarks for long-form video generation by introducing a hierarchical movie-level dataset with scripts, character banks, scene breakdowns, and shot-level annotations. The approach enables script-to-movie generation with character consistency and audio synchronization across multiple scenes. The paper introduces novel character-consistency metrics and demonstrates through multiple experiments that current methods struggle with multi-character consistency, multi-view coherence, and synchronized audio. The dataset and analysis provide a foundation for progress toward coherent, multi-scene narrative video generation.

Abstract

Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: https://weijiawu.github.io/MovieBench/.

MovieBench: A Hierarchical Movie Level Dataset for Long Video Generation

TL;DR

MovieBench tackles the lack of benchmarks for long-form video generation by introducing a hierarchical movie-level dataset with scripts, character banks, scene breakdowns, and shot-level annotations. The approach enables script-to-movie generation with character consistency and audio synchronization across multiple scenes. The paper introduces novel character-consistency metrics and demonstrates through multiple experiments that current methods struggle with multi-character consistency, multi-view coherence, and synchronized audio. The dataset and analysis provide a foundation for progress toward coherent, multi-scene narrative video generation.

Abstract

Recent advancements in video generation models, like Stable Video Diffusion, show promising results, but primarily focus on short, single-scene videos. These models struggle with generating long videos that involve multiple scenes, coherent narratives, and consistent characters. Furthermore, there is no publicly available dataset tailored for the analysis, evaluation, and training of long video generation models. In this paper, we present MovieBench: A Hierarchical Movie-Level Dataset for Long Video Generation, which addresses these challenges by providing unique contributions: (1) movie-length videos featuring rich, coherent storylines and multi-scene narratives, (2) consistency of character appearance and audio across scenes, and (3) hierarchical data structure contains high-level movie information and detailed shot-level descriptions. Experiments demonstrate that MovieBench brings some new insights and challenges, such as maintaining character ID consistency across multiple scenes for various characters. The dataset will be public and continuously maintained, aiming to advance the field of long video generation. Data can be found at: https://weijiawu.github.io/MovieBench/.

Paper Structure

This paper contains 30 sections, 1 equation, 14 figures, 9 tables.

Figures (14)

  • Figure 1: Video Generation $v.s$ Movie Generation. The text-to-video paradigm (MiraData ju2024miradata) takes a text input without character information and generates a short video. In contrast, script-to-movie generation involves a complex storyline, requiring character consistency, plot progression, and audio synchronization.
  • Figure 2: MovieBench Dataset.MovieBench categorizes the movie annotations into three hierarchical data levels, representing different granularities of information: 1) Movie level provides a broad overview of the film; 2) Scene level provides mid-level scene consistency information; 3) Shot level emphasizes specific moments with detailed descriptions.
  • Figure 3: Annotation Generation for Shot-Level Video. With video, character banks, audio, and movie descriptions, VLM can summarize the content, including characters and plot.
  • Figure 4: Character Frequency Statistics. The frequency of different characters varies significantly.
  • Figure 5: Distribution of Scenes. The number of scenes varies significantly across different movies, from $20$ to $350$ scenes.
  • ...and 9 more figures