Table of Contents
Fetching ...

Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

Chao Zhou, Tianyi Wei, Yiling Chen, Wenbo Zhou, Nenghai Yu

TL;DR

This paper tackles the computational bottleneck of multi-condition Diffusion Transformers arising from the concatenation-based attention over many condition tokens. It introduces Position-aligned and Keyword-scoped Attention (PKA), decomposing attention into Position-Aligned Attention (PAA) for spatial control and Keyword-Scoped Attention (KSA) for subject-driven control, combined with a Conditional Sensitivity-Aware Sampling (CSAS) strategy to focus training on the most impactful denoising phases. PAA achieves near-linear complexity by enforcing one-to-one spatial alignment, while KSA employs a semantic mask to prune nonessential cross-attention in subject regions. CSAS further accelerates convergence by biasing training toward early, high-sensitivity timesteps. Empirically, the approach yields up to 10x inference speedup and 5.1x VRAM reduction in the attention module, with competitive or superior image fidelity, controllability, and subject consistency across multiple multi-condition tasks, validating its scalability and practicality for fine-grained multi-condition diffusion generation.

Abstract

While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0$\times$ inference speedup and a 5.1$\times$ VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.

Rethinking Multi-Condition DiTs: Eliminating Redundant Attention via Position-Alignment and Keyword-Scoping

TL;DR

This paper tackles the computational bottleneck of multi-condition Diffusion Transformers arising from the concatenation-based attention over many condition tokens. It introduces Position-aligned and Keyword-scoped Attention (PKA), decomposing attention into Position-Aligned Attention (PAA) for spatial control and Keyword-Scoped Attention (KSA) for subject-driven control, combined with a Conditional Sensitivity-Aware Sampling (CSAS) strategy to focus training on the most impactful denoising phases. PAA achieves near-linear complexity by enforcing one-to-one spatial alignment, while KSA employs a semantic mask to prune nonessential cross-attention in subject regions. CSAS further accelerates convergence by biasing training toward early, high-sensitivity timesteps. Empirically, the approach yields up to 10x inference speedup and 5.1x VRAM reduction in the attention module, with competitive or superior image fidelity, controllability, and subject consistency across multiple multi-condition tasks, validating its scalability and practicality for fine-grained multi-condition diffusion generation.

Abstract

While modern text-to-image models excel at prompt-based generation, they often lack the fine-grained control necessary for specific user requirements like spatial layouts or subject appearances. Multi-condition control addresses this, yet its integration into Diffusion Transformers (DiTs) is bottlenecked by the conventional ``concatenate-and-attend'' strategy, which suffers from quadratic computational and memory overhead as the number of conditions scales. Our analysis reveals that much of this cross-modal interaction is spatially or semantically redundant. To this end, we propose Position-aligned and Keyword-scoped Attention (PKA), a highly efficient framework designed to eliminate these redundancies. Specifically, Position-Aligned Attention (PAA) linearizes spatial control by enforcing localized patch alignment, while Keyword-Scoped Attention (KSA) prunes irrelevant subject-driven interactions via semantic-aware masking. To facilitate efficient learning, we further introduce a Conditional Sensitivity-Aware Sampling (CSAS) strategy that reweights the training objective towards critical denoising phases, drastically accelerating convergence and enhancing conditional fidelity. Empirically, PKA delivers a 10.0 inference speedup and a 5.1 VRAM saving, providing a scalable and resource-friendly solution for high-fidelity multi-conditioned generation.
Paper Structure (21 sections, 5 equations, 14 figures, 1 table)

This paper contains 21 sections, 5 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Visual results of our proposed PKA on multi-conditional generation. Our proposed PKA achieves high-quality multi-conditional generation with remarkable efficiency. Zoom in for better visualization.
  • Figure 2: Visualization of the attention matrix in spatial-aligned generation. The heatmap exhibits a striking diagonal-dominant pattern, demonstrating that attention activations are primarily constrained to spatially congruent or proximal patches.
  • Figure 3: Attention maps in subject-driven generation. Prompt: "On the beach, a lady wearing this shirt sits under a beach umbrella." X is the noisy image.
  • Figure 4: Overview of our method. (a) The denoise framework. Full computation occurs only at the first step; the Keys and Values of all condition tokens are then cached for subsequent steps. (b) Position-aligned and Keyword-scoped Attention.Our decomposed attention mechanism, where conditions only perform self-attention (enabling the KV cache). The noisy image tokens (X) then interact with spatial (SP) and subject (SJ) conditions via PAA and KSA, respectively. (c) Position-Aligned Attention (PAA). PAA performs efficient one-to-one attention between the image (X) and spatial condition (SP) tokens at their aligned positions. (d) Keyword-Scoped Attention (KSA). KSA computes a relevance mask from text keywords in one step. This mask is then applied in subsequent steps to confine the attention computation between the image (X) and subject (SJ) to only the most relevant regions.
  • Figure 5: Qualitative results of visual condition perturbation. Left to right: visual condition, 0 (no perturbation), 7, 14, 21 perturbation steps, and 28 steps (no visual condition). Zoom in for better visualization.
  • ...and 9 more figures