Table of Contents
Fetching ...

Comp-Attn: Present-and-Align Attention for Compositional Video Generation

Hongyu Zhang, Yufan Deng, Shenghai Yuan, Yian Zhao, Peng Jin, Xuehan Hou, Chang Liu, Jie Chen

TL;DR

We address the challenge of compositional text-to-video generation by separating subject presence from inter-subject relational alignment. Our training-free Comp-Attn introduces SCI to reinforce subject-level conditioning and LAM to align attention with LLM-planned layouts via IOU-guided modulation, enabling faithful multi-subject scenes with minimal overhead. Empirical results on T2V-CompBench, VBench, and T2I-CompBench show substantial improvements in subject presence, spatial relations, and overall semantic quality across multiple backbones, while incurring modest latency increases. The framework generalizes to T2I tasks and offers a scalable, plug-and-play solution for robust compositional video and image generation.

Abstract

In the domain of text-to-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a Present-and-Align paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject's presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7\% and 11.7\% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5\% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general video generation and compositional text-to-image (T2I) tasks.

Comp-Attn: Present-and-Align Attention for Compositional Video Generation

TL;DR

We address the challenge of compositional text-to-video generation by separating subject presence from inter-subject relational alignment. Our training-free Comp-Attn introduces SCI to reinforce subject-level conditioning and LAM to align attention with LLM-planned layouts via IOU-guided modulation, enabling faithful multi-subject scenes with minimal overhead. Empirical results on T2V-CompBench, VBench, and T2I-CompBench show substantial improvements in subject presence, spatial relations, and overall semantic quality across multiple backbones, while incurring modest latency increases. The framework generalizes to T2I tasks and offers a scalable, plug-and-play solution for robust compositional video and image generation.

Abstract

In the domain of text-to-video (T2V) generation, reliably synthesizing compositional content involving multiple subjects with intricate relations is still underexplored. The main challenges are twofold: 1) Subject presence, where not all subjects can be presented in the video; 2) Inter-subject relations, where the interaction and spatial relationship between subjects are misaligned. Existing methods adopt techniques, such as inference-time latent optimization or layout control, which fail to address both issues simultaneously. To tackle these problems, we propose Comp-Attn, a composition-aware cross-attention variant that follows a Present-and-Align paradigm: it decouples the two challenges by enforcing subject presence at the condition level and achieving relational alignment at the attention-distribution level. Specifically, 1) We introduce Subject-aware Condition Interpolation (SCI) to reinforce subject-specific conditions and ensure each subject's presence; 2) We propose Layout-forcing Attention Modulation (LAM), which dynamically enforces the attention distribution to align with the relational layout of multiple subjects. Comp-Attn can be seamlessly integrated into various T2V baselines in a training-free manner, boosting T2V-CompBench scores by 15.7\% and 11.7\% on Wan2.1-T2V-14B and Wan2.2-T2V-A14B with only a 5\% increase in inference time. Meanwhile, it also achieves strong performance on VBench and T2I-CompBench, demonstrating its scalability in general video generation and compositional text-to-image (T2I) tasks.

Paper Structure

This paper contains 43 sections, 13 equations, 10 figures, 8 tables.

Figures (10)

  • Figure 1: We propose Comp-Attn, a novel "Present-and-Align" paradigm for compositional T2V generation. (a) Qualitative comparison, where Comp-Attn effectively addresses both subject presence and inter-subject relation challenges. (b) Quantitative comparison, where Comp-Attn achieves significant performance improvement on T2V-CompBench with good efficiency. (c) Motivation overview, Comp-Attn injects composition awareness into the condition and attention distribution of the cross-attention layer.
  • Figure 2: Detailed Architecture of Comp-Attn. Comp-Attn enhances compositional T2V performance at the cross-attention layer through a "Present-and-Align" paradigm. (1) Subject-aware Condition Interpolation (SCI) reinforces subject-specific semantics during conditioning, ensuring the presence of each subject. (2) Layout-forcing Attention Modulation (LAM) aligns fine-grained inter-subject relationships by modulating attention distributions.
  • Figure 3: Attention response analysis. At the 10% timestep, attention scores are averaged across heads. The attention response for "Elliptical mirror" is weak, while that for "Square window" is overly strong, leading to subject absence with Wan2.2-A14B. SCI balances attention responses by restoring the original semantics of subjects, ensuring the presence of each subject.
  • Figure 4: Attention distribution analysis. The incorrect spatial attention distribution in Wan2.2-A14B causes positional errors, while LAM adjusts it to accurately reflect inter-subject relationships. The colored boxes represent the LLM layout.
  • Figure 5: Qualitative comparison on generating compositional contents. Comp-Attn achieves superior performance in both subject presence and inter-subject relations, surpassing other compositional generation paradigms as well as powerful video foundation models.
  • ...and 5 more figures