Table of Contents
Fetching ...

Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

Zitong Wang, Zijun Shen, Haohao Xu, Zhengjie Luo, Weibin Wu

TL;DR

Delta-K is proposed, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space and extracting a differential key that encodes the semantic signature of missing concepts during the diffusion process.

Abstract

While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key $ΔK$ that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.

Delta-K: Boosting Multi-Instance Generation via Cross-Attention Augmentation

TL;DR

Delta-K is proposed, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space and extracting a differential key that encodes the semantic signature of missing concepts during the diffusion process.

Abstract

While Diffusion Models excel in text-to-image synthesis, they often suffer from concept omission when synthesizing complex multi-instance scenes. Existing training-free methods attempt to resolve this by rescaling attention maps, which merely exacerbates unstructured noise without establishing coherent semantic representations. To address this, we propose Delta-K, a backbone-agnostic and plug-and-play inference framework that tackles omission by operating directly in the shared cross-attention Key space. Specifically, with Vision-language model, we extract a differential key that encodes the semantic signature of missing concepts. This signal is then injected during the early semantic planning stage of the diffusion process. Governed by a dynamically optimized scheduling mechanism, Delta-K grounds diffuse noise into stable structural anchors while preserving existing concepts. Extensive experiments demonstrate the generality of our approach: Delta-K consistently improves compositional alignment across both modern DiT models and classical U-Net architectures, without requiring spatial masks, additional training, or architectural modifications.
Paper Structure (46 sections, 2 theorems, 24 equations, 8 figures, 8 tables, 1 algorithm)

This paper contains 46 sections, 2 theorems, 24 equations, 8 figures, 8 tables, 1 algorithm.

Key Result

Theorem 1

Let the cross-attention dimension be $d_k$. Let $Q_\text{present} \in \mathbb{R}^{d_k}$ denote the query vector associated with a successfully generated concept, and $\Delta K \in \mathbb{R}^{d_k}$ denote the differential key vector extracted through the VLM-guided masking procedure. Assume that emb where $c>0$ is a constant determined by the sub-Gaussian norm of the embeddings.

Figures (8)

  • Figure 1: Overview of Delta-K. A VLM first separates present and missing concepts from a baseline generation. By contrasting the original and masked prompts, we obtain a differential key vector $\Delta K$, which is dynamically injected into cross-attention keys during sampling to reinforce missing concepts while preserving existing content.
  • Figure 2: Spatiotemporal dynamics of attention in SD3.5.(a) Missing concepts suffer from chronic intensity suppression but follow valid temporal trends. (b) The high early AUC identifies a semantic planning phase for intervention before image structure solidifies. (c) High instability (CV) characterizes missing tokens as unstable noise.
  • Figure 3: Examples of Delta-K. By using SDXL podellsdxl, SD-2.1, Nano banana and DALL-E 3 betker2023improving as baseline methods for comparison, we can easily observe that our approach achieves significant improvements in addressing the instance missing problem.
  • Figure 4: Qualitative rectification and cross-attention dynamics. Top: Delta-K successfully recovers omitted instances across both SDXL and SD3.5. Bottom: Cross-attention heatmaps for the SDXL example. In the baseline, the attention map for the missing token ("white dog") is scattered and noisy. Delta-K successfully focuses this attention into a highly localized region. Importantly, the attention for the present token ("black dog") remains nearly unchanged, which demonstrates that Delta-K achieves targeted augmentation without interfering with present tokens.
  • Figure 5: Quantitative analysis of temporal dynamics and adaptive scheduling.(a) Evolution of Attention Weight: Delta-K increases the mean attention activation of the missing token, resolving the suppression observed in the baseline. (b)$CV$ and $\alpha_t$ Trajectory: The dynamic scheduler concentrates the intervention strength $\alpha_t$ during the early generation steps. This early injection significantly reduces spatial instability $CV$ compared to the baseline. By lowering the $CV$ while the image layout forms, Delta-K successfully focuses the scattered attention into a stable region.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1: Semantic Orthogonality Bound
  • proof
  • Theorem 2: Attention Mass Concentration
  • proof