Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping
Chao Zhou, Tianyi Wei, Yiling Chen, Wenbo Zhou, Nenghai Yu
TL;DR
This paper tackles the computational bottleneck of multi-condition Diffusion Transformers arising from the concatenation-based attention over many condition tokens. It introduces Position-aligned and Keyword-scoped Attention (PKA), decomposing attention into Position-Aligned Attention (PAA) for spatial control and Keyword-Scoped Attention (KSA) for subject-driven control, combined with a Conditional Sensitivity-Aware Sampling (CSAS) strategy to focus training on the most impactful denoising phases. PAA achieves near-linear complexity by enforcing one-to-one spatial alignment, while KSA employs a semantic mask to prune nonessential cross-attention in subject regions. CSAS further accelerates convergence by biasing training toward early, high-sensitivity timesteps. Empirically, the approach yields up to 10x inference speedup and 5.1x VRAM reduction in the attention module, with competitive or superior image fidelity, controllability, and subject consistency across multiple multi-condition tasks, validating its scalability and practicality for fine-grained multi-condition diffusion generation.
Abstract
While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0$\times$ inference speedup and a 5.1$\times$ VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.
