Comp-Attn: Present-and-Align Attention for Compositional Video Generation

Hongyu Zhang; Yufan Deng; Shenghai Yuan; Yian Zhao; Peng Jin; Xuehan Hou; Chang Liu; Jie Chen

Comp-Attn: Present-and-Align Attention for Compositional Video Generation

Hongyu Zhang, Yufan Deng, Shenghai Yuan, Yian Zhao, Peng Jin, Xuehan Hou, Chang Liu, Jie Chen

TL;DR

We address the challenge of compositional text-to-video generation by separating subject presence from inter-subject relational alignment. Our training-free Comp-Attn introduces SCI to reinforce subject-level conditioning and LAM to align attention with LLM-planned layouts via IOU-guided modulation, enabling faithful multi-subject scenes with minimal overhead. Empirical results on T2V-CompBench, VBench, and T2I-CompBench show substantial improvements in subject presence, spatial relations, and overall semantic quality across multiple backbones, while incurring modest latency increases. The framework generalizes to T2I tasks and offers a scalable, plug-and-play solution for robust compositional video and image generation.

Abstract

In the domain of text-to-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a Present-and-Align paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject's presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7\% and 11.7\% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5\% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general video generation and compositional text-to-image (T2I) tasks.

Comp-Attn: Present-and-Align Attention for Compositional Video Generation

TL;DR

Abstract

Comp-Attn: Present-and-Align Attention for Compositional Video Generation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (10)