Table of Contents
Fetching ...

Event-Driven Video Generation

Chika Maduabuchi

Abstract

State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.

Event-Driven Video Generation

Abstract

State-of-the-art text-to-video models often look realistic frame-by-frame yet fail on simple interactions: motion starts before contact, actions are not realized, objects drift after placement, and support relations break. We argue this stems from frame-first denoising, which updates latent state everywhere at every step without an explicit notion of when and where an interaction is active. We introduce Event-Driven Video Generation (EVD), a minimal DiT-compatible framework that makes sampling event-grounded: a lightweight event head predicts token-aligned event activity, event-grounded losses couple activity to state change during training, and event-gated sampling (with hysteresis and early-step scheduling) suppresses spurious updates while concentrating updates during interactions. On EVD-Bench, EVD consistently improves human preference and VBench dynamics, substantially reducing failure modes in state persistence, spatial accuracy, support relations, and contact stability without sacrificing appearance. These results indicate that explicit event grounding is a practical abstraction for reducing interaction hallucinations in video generation.
Paper Structure (246 sections, 74 equations, 4 figures, 9 tables, 2 algorithms)

This paper contains 246 sections, 74 equations, 4 figures, 9 tables, 2 algorithms.

Figures (4)

  • Figure 1: Representative text-conditioned video outputs produced by EVD. Event-Driven Video Generation enforces causal state transitions, eliminating hallucinated motion and physically implausible interactions that persist in state-of-the-art video diffusion models.
  • Figure 2: Failure taxonomy of DiT-30B under simple physical interactions. Examples of systematic breakdowns in DiT-30B generations: (a) state persistence, e.g., post-interaction dynamics (the chair continues moving after the pulling action has ceased); (b) spatial accuracy, e.g., object placement (the cube fails to align with the intended platform); (c) support relations, e.g., event realization (the book appears stacked without a visible stacking action); and (d) contact stability, e.g., causal initiation (the plate begins moving before any hand--object contact occurs).
  • Figure 3: Representative text-conditioned video generations from EVD. EVD produces coherent event-driven dynamics across a diverse set of interactions, including target-directed motion, constrained mechanisms, deformation and recovery, gravity-mediated closure, multi-agent coordination, and liquid transfer. These examples illustrate that EVD captures causally grounded state transitions beyond simple frame-to-frame motion synthesis.
  • Figure 4: Qualitative comparison with leading video generation baselines. We compare EVD against Movie Gen, Sora, and DiT-30B on representative prompts involving soft-body deformation, flexible-object dynamics, structured scene interactions, and liquid transfer. Across all examples, baseline models often exhibit incomplete event realization, weak contact-response coupling, or implausible state evolution, whereas EVD produces temporally ordered interactions and more coherent physical outcomes.