Table of Contents
Fetching ...

MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

Daewon Yoon, Hyungsuk Lee, Wonsik Shin

TL;DR

MSG tackles multi-scene video generation by decoupling frame-level and scene-level processing, using a Bidirectional Frame Reference (BFFR) for intra-scene fidelity and a Backward Scene Reference (BSR) for inter-scene transitions to enhance temporal consistency. The approach combines a multi-component loss with a structured implementation workflow to promote short-term detail preservation and long-term coherence, and it envisions a score-based evaluation benchmark to automate selection of high-quality outputs. The study situates MSG within diffusion-based text-to-video literature and reviews existing temporal-consistency strategies, while reporting that experimental validation was not yet conclusive, highlighting the need for further optimization. If successful, MSG could enable more reliable, long-horizon video generation from multi-prompt narratives with reduced manual curation and improved narrative continuity.

Abstract

This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.

MSG score: A Comprehensive Evaluation for Multi-Scene Video Generation

TL;DR

MSG tackles multi-scene video generation by decoupling frame-level and scene-level processing, using a Bidirectional Frame Reference (BFFR) for intra-scene fidelity and a Backward Scene Reference (BSR) for inter-scene transitions to enhance temporal consistency. The approach combines a multi-component loss with a structured implementation workflow to promote short-term detail preservation and long-term coherence, and it envisions a score-based evaluation benchmark to automate selection of high-quality outputs. The study situates MSG within diffusion-based text-to-video literature and reviews existing temporal-consistency strategies, while reporting that experimental validation was not yet conclusive, highlighting the need for further optimization. If successful, MSG could enable more reliable, long-horizon video generation from multi-prompt narratives with reduced manual curation and improved narrative continuity.

Abstract

This paper addresses the metrics required for generating multi-scene videos based on a continuous scenario, as opposed to traditional short video generation. Scenario-based videos require a comprehensive evaluation that considers multiple factors such as character consistency, artistic coherence, aesthetic quality, and the alignment of the generated content with the intended prompt. Additionally, in video generation, unlike single images, the movement of characters across frames introduces potential issues like distortion or unintended changes, which must be effectively evaluated and corrected. In the context of probabilistic models like diffusion, generating the desired scene requires repeated sampling and manual selection, akin to how a film director chooses the best shots from numerous takes. We propose a score-based evaluation benchmark that automates this process, enabling a more objective and efficient assessment of these complexities. This approach allows for the generation of high-quality multi-scene videos by selecting the best outcomes based on automated scoring rather than manual inspection.

Paper Structure

This paper contains 20 sections, 1 equation.