When and Where do Events Switch in Multi-Event Video Generation?
Ruotong Liao, Guowen Huang, Qing Cheng, Thomas Seidl, Daniel Cremers, Volker Tresp
TL;DR
This paper investigates when and where multi-event prompts steer diffusion-based text-to-video generation. It introduces MEve, a dual-event benchmark assembled from LLM-generated prompts, diagnostic-content prompts, and viewpoint-controlled prompts, and systematically evaluates CogVideoX and OpenSora models. The key finding is that exposing the second event within the first $\sim 0.3$ of denoising steps and constraining the first event to shallow DiT blocks largely determines the global event semantics, while later steps and deeper blocks have limited capacity to introduce new events. The work demonstrates that early denoising and shallow-layer conditioning are the dominant factors for multi-event transitions and points to practical directions for future multi-event conditioning in diffusion-based video synthesis.
Abstract
Text-to-video (T2V) generation has surged in response to challenging questions, especially when a long video must depict multiple sequential events with temporal coherence and controllable content. Existing methods that extend to multi-event generation omit an inspection of the intrinsic factor in event shifting. The paper aims to answer the central question: When and where multi-event prompts control event transition during T2V generation. This work introduces MEve, a self-curated prompt suite for evaluating multi-event text-to-video (T2V) generation, and conducts a systematic study of two representative model families, i.e., OpenSora and CogVideoX. Extensive experiments demonstrate the importance of early intervention in denoising steps and block-wise model layers, revealing the essential factor for multi-event video generation and highlighting the possibilities for multi-event conditioning in future models.
