DyST-XL: Dynamic Layout Planning and Content Control for Compositional Text-to-Video Generation
Weijie He, Mushui Liu, Yunlong Yu, Zhao Wang, Chao Wu
TL;DR
DyST-XL tackles the challenge of compositional text-to-video generation by proposing a training-free framework that augments pre-trained diffusion models with three mechanisms: a Dynamic Layout Planner that uses LLM-based prompt parsing to generate physics-aware trajectories, a Dual-Prompt Controlled Attention system that enforces localized text-video alignment via frame-aware masking, and an Entity-Consistency Constraint strategy that propagates first-frame features to maintain object identity across frames. The approach achieves state-of-the-art results on the T2V-CompBench benchmark, outperforming both diffusion U-Net–based and DiT-based baselines across multiple metrics while remaining training-free. Key contributions include a unified, non-parametric layout-to-motion design, fine-grained attention control for multi-entity prompts, and temporal feature propagation to preserve identity, enabling robust generation of complex dynamic scenes with multiple interacting entities. The work demonstrates the practicality of training-free video synthesis with improved fidelity and coherence, suggesting a scalable path for rapid iteration in multi-entity video storytelling.
Abstract
Compositional text-to-video generation, which requires synthesizing dynamic scenes with multiple interacting entities and precise spatial-temporal relationships, remains a critical challenge for diffusion-based models. Existing methods struggle with layout discontinuity, entity identity drift, and implausible interaction dynamics due to unconstrained cross-attention mechanisms and inadequate physics-aware reasoning. To address these limitations, we propose DyST-XL, a \textbf{training-free} framework that enhances off-the-shelf text-to-video models (e.g., CogVideoX-5B) through frame-aware control. DyST-XL integrates three key innovations: (1) A Dynamic Layout Planner that leverages large language models (LLMs) to parse input prompts into entity-attribute graphs and generates physics-aware keyframe layouts, with intermediate frames interpolated via trajectory optimization; (2) A Dual-Prompt Controlled Attention Mechanism that enforces localized text-video alignment through frame-aware attention masking, achieving precise control over individual entities; and (3) An Entity-Consistency Constraint strategy that propagates first-frame feature embeddings to subsequent frames during denoising, preserving object identity without manual annotation. Experiments demonstrate that DyST-XL excels in compositional text-to-video generation, significantly improving performance on complex prompts and bridging a crucial gap in training-free video synthesis. The code is released in https://github.com/XiaoBuL/DyST-XL.
