Table of Contents
Fetching ...

When and Where do Events Switch in Multi-Event Video Generation?

Ruotong Liao, Guowen Huang, Qing Cheng, Thomas Seidl, Daniel Cremers, Volker Tresp

TL;DR

This paper investigates when and where multi-event prompts steer diffusion-based text-to-video generation. It introduces MEve, a dual-event benchmark assembled from LLM-generated prompts, diagnostic-content prompts, and viewpoint-controlled prompts, and systematically evaluates CogVideoX and OpenSora models. The key finding is that exposing the second event within the first $\sim 0.3$ of denoising steps and constraining the first event to shallow DiT blocks largely determines the global event semantics, while later steps and deeper blocks have limited capacity to introduce new events. The work demonstrates that early denoising and shallow-layer conditioning are the dominant factors for multi-event transitions and points to practical directions for future multi-event conditioning in diffusion-based video synthesis.

Abstract

Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

When and Where do Events Switch in Multi-Event Video Generation?

TL;DR

This paper investigates when and where multi-event prompts steer diffusion-based text-to-video generation. It introduces MEve, a dual-event benchmark assembled from LLM-generated prompts, diagnostic-content prompts, and viewpoint-controlled prompts, and systematically evaluates CogVideoX and OpenSora models. The key finding is that exposing the second event within the first of denoising steps and constraining the first event to shallow DiT blocks largely determines the global event semantics, while later steps and deeper blocks have limited capacity to introduce new events. The work demonstrates that early denoising and shallow-layer conditioning are the dominant factors for multi-event transitions and points to practical directions for future multi-event conditioning in diffusion-based video synthesis.

Abstract

Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.

Paper Structure

This paper contains 24 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Probing the turning points of multi-event generation
  • Figure 2: Category distribution of the MEve dataset.
  • Figure 3: Experiment results on MEve for RQ1 (when) and RQ2 (where), as shown in Columns 1–4 (RQ1) and 5 (RQ2). Curves are category-wise means over prompts; the X-axis is the fusion ratio (RQ1) or the block split ratio (RQ2) $x\in[0,1]$. Top row: Text Alignment (TA$\uparrow$). Middle row: Background Consistency (BC$\uparrow$). Bottom row: Identity Consistency (IC$\uparrow$). In the first row of TA, each color pair represents a group of dual-event generations. Triangle dots denote $\text{P}_1$ related TA, and Round dots denote $\text{P}_2$ related TA.
  • Figure 4: Comparison of models on the same set of prompts per category \ref{['tab:dataset_ratios']}.
  • Figure 5: Qualitative visualizations.