Table of Contents
Fetching ...

AID: Attention Interpolation of Text-to-Image Diffusion

Qiyuan He, Jinghao Wang, Ziwei Liu, Angela Yao

TL;DR

This work tackles conditional interpolation in text-to-image diffusion models, a challenging task when multiple text conditions must blend smoothly and coherently. It introduces AID, a training-free framework that improves interpolation by (i) applying fused inner/outer interpolated attention to both cross- and self-attention, (ii) using a Beta-distribution prior to non-uniformly sample interpolation points for smoother transitions, and (iii) augmenting with PAID to guide interpolation paths via a user-provided prompt. The approach yields substantial gains in consistency, smoothness, and fidelity across benchmarks and downstream tasks, including image editing control and compositional generation, without model training. The key contributions include a formal analysis of TEI failures, a practical fused-attention interpolation mechanism, a Beta-prior sampling strategy, and a prompt-guided extension enabling explicit path control, all demonstrated through comprehensive quantitative and human studies.

Abstract

Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or poses is less understood. Simple approaches, such as linear interpolation in the space of conditions, often result in images that lack consistency, smoothness, and fidelity. To that end, we introduce a novel training-free technique named Attention Interpolation via Diffusion (AID). Our key contributions include 1) proposing an inner/outer interpolated attention layer; 2) fusing the interpolated attention with self-attention to boost fidelity; and 3) applying beta distribution to selection to increase smoothness. We also present a variant, Prompt-guided Attention Interpolation via Diffusion (PAID), that considers interpolation as a condition-dependent generative process. This method enables the creation of new images with greater consistency, smoothness, and efficiency, and offers control over the exact path of interpolation. Our approach demonstrates effectiveness for conceptual and spatial interpolation. Code and demo are available at https://github.com/QY-H00/attention-interpolation-diffusion.

AID: Attention Interpolation of Text-to-Image Diffusion

TL;DR

This work tackles conditional interpolation in text-to-image diffusion models, a challenging task when multiple text conditions must blend smoothly and coherently. It introduces AID, a training-free framework that improves interpolation by (i) applying fused inner/outer interpolated attention to both cross- and self-attention, (ii) using a Beta-distribution prior to non-uniformly sample interpolation points for smoother transitions, and (iii) augmenting with PAID to guide interpolation paths via a user-provided prompt. The approach yields substantial gains in consistency, smoothness, and fidelity across benchmarks and downstream tasks, including image editing control and compositional generation, without model training. The key contributions include a formal analysis of TEI failures, a practical fused-attention interpolation mechanism, a Beta-prior sampling strategy, and a prompt-guided extension enabling explicit path control, all demonstrated through comprehensive quantitative and human studies.

Abstract

Conditional diffusion models can create unseen images in various settings, aiding image interpolation. Interpolation in latent spaces is well-studied, but interpolation with specific conditions like text or poses is less understood. Simple approaches, such as linear interpolation in the space of conditions, often result in images that lack consistency, smoothness, and fidelity. To that end, we introduce a novel training-free technique named Attention Interpolation via Diffusion (AID). Our key contributions include 1) proposing an inner/outer interpolated attention layer; 2) fusing the interpolated attention with self-attention to boost fidelity; and 3) applying beta distribution to selection to increase smoothness. We also present a variant, Prompt-guided Attention Interpolation via Diffusion (PAID), that considers interpolation as a condition-dependent generative process. This method enables the creation of new images with greater consistency, smoothness, and efficiency, and offers control over the exact path of interpolation. Our approach demonstrates effectiveness for conceptual and spatial interpolation. Code and demo are available at https://github.com/QY-H00/attention-interpolation-diffusion.
Paper Structure (29 sections, 1 theorem, 19 equations, 17 figures, 2 tables)

This paper contains 29 sections, 1 theorem, 19 equations, 17 figures, 2 tables.

Key Result

Proposition 1

Given query $Q$ from a latent variable $z$, keys and values $\{K_1, V_1\}$ and $\{K_m, V_m\}$ from text conditions $\{c_1,c_m\}$ and linearly interpolated text conditions $c_i$, the resulting cross-attention module $A(z, c_i)$ is given by linearly interpolated keys and values $\bar{K}_i$ and $\bar{V where $w_i$ is defined similarly as Eq. eq:input_interpolation.

Figures (17)

  • Figure 1: Our approach enables text-to-image diffusion models to generate nuanced spatial and conceptual interpolations, with seamless transitions in layout (a), smooth conceptual blending (b-e) as, and user-specified prompts to guide the interpolation paths (f).
  • Figure 2: Results comparison between AID (the $1^\text{st}$ row) and text embedding interpolation (the $2^\text{nd}$ row). AID increases smoothness, consistency, and fidelity significantly.
  • Figure 3: An overview of PAID: Prompt-guided Attention Interpolation of Diffusion. The main components include: (1) Replacing both cross-attention and self-attention when generating interpolated image by fused interpolated attention; (2) Selecting interpolation coefficients with Beta prior; (3) Inject prompt guidance in the fused interpolated cross-attention.
  • Figure 4: Qualitative comparison of different ablation setting of AID. (a) Qualitative comparison between AID without fusion ($1^\text{st}$ row), AID with fusion ($2^\text{nd}$ row), and AID with fusion and beta prior ($3^\text{rd}$ row). Fusing interpolation with self-attention alleviates the artifacts of the interpolated image significantly, while beta prior increases smoothness based on AID with fusion. (b) CLIP score of different methods on composition generation.
  • Figure 5: Results of image editing control. Our method boosts the controlling ability over editing. The first row of (a) and (b) is generated by P2P + AID while the second row is P2P + TEI.
  • ...and 12 more figures

Theorems & Definitions (1)

  • Proposition 1