Table of Contents
Fetching ...

Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing

Wenyi Mo, Tianyu Zhang, Yalong Bai, Bing Su, Ji-Rong Wen

TL;DR

The paper tackles the fidelity gap in tuning-free diffusion-based image editing by diagnosing non-uniform cross-attention as a key source of reconstruction errors during DDIM inversion. It introduces Uniform Cross-attention Maps to stabilize semantic guidance and an adaptive mask-guided editing scheme that blends auxiliary and target branches to preserve detail while enacting edits. Empirical results across reconstruction, composition, and editing tasks demonstrate improved fidelity and robustness, with ablations validating critical hyperparameters. The approach offers a practical, tuning-free pathway to higher-quality diffusion-based image processing, with strong implications for real-world editing and composition workflows. The method achieves high faithfulness to input images and coherent, targeted edits, underscoring the potential of uniform attention as a general technique in diffusion models.

Abstract

Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at https://github.com/Mowenyii/Uniform-Attention-Maps.

Uniform Attention Maps: Boosting Image Fidelity in Reconstruction and Editing

TL;DR

The paper tackles the fidelity gap in tuning-free diffusion-based image editing by diagnosing non-uniform cross-attention as a key source of reconstruction errors during DDIM inversion. It introduces Uniform Cross-attention Maps to stabilize semantic guidance and an adaptive mask-guided editing scheme that blends auxiliary and target branches to preserve detail while enacting edits. Empirical results across reconstruction, composition, and editing tasks demonstrate improved fidelity and robustness, with ablations validating critical hyperparameters. The approach offers a practical, tuning-free pathway to higher-quality diffusion-based image processing, with strong implications for real-world editing and composition workflows. The method achieves high faithfulness to input images and coherent, targeted edits, underscoring the potential of uniform attention as a general technique in diffusion models.

Abstract

Text-guided image generation and editing using diffusion models have achieved remarkable advancements. Among these, tuning-free methods have gained attention for their ability to perform edits without extensive model adjustments, offering simplicity and efficiency. However, existing tuning-free approaches often struggle with balancing fidelity and editing precision. Reconstruction errors in DDIM Inversion are partly attributed to the cross-attention mechanism in U-Net, which introduces misalignments during the inversion and reconstruction process. To address this, we analyze reconstruction from a structural perspective and propose a novel approach that replaces traditional cross-attention with uniform attention maps, significantly enhancing image reconstruction fidelity. Our method effectively minimizes distortions caused by varying text conditions during noise prediction. To complement this improvement, we introduce an adaptive mask-guided editing technique that integrates seamlessly with our reconstruction approach, ensuring consistency and accuracy in editing tasks. Experimental results demonstrate that our approach not only excels in achieving high-fidelity image reconstruction but also performs robustly in real image composition and editing scenarios. This study underscores the potential of uniform attention maps to enhance the fidelity and versatility of diffusion-based image processing methods. Code is available at https://github.com/Mowenyii/Uniform-Attention-Maps.

Paper Structure

This paper contains 16 sections, 9 equations, 14 figures, 6 tables, 1 algorithm.

Figures (14)

  • Figure 1: (a) Image reconstruction using DDIM with different prompts. The first image shows the input image, followed by the reconstruction using the source prompt "a photo of avocados," the null prompt (an empty string), and the result using Uniform Attention Maps combined with token values from the null prompt. (b) Our approach introduces Uniform Attention Maps, where traditional attention maps are replaced with uniform maps that distribute attention weights equally across the token dimension. By combining these uniform maps with the value tokens $V$, we generate a more balanced attention term $A$. This method ensures consistent attention, resulting in more accurate image reconstructions, as demonstrated in the final image of part (a).
  • Figure 2: The process of reconstruction using DDIM inversion under various conditions. It visually depicting (a) the heatmaps of the cross-attention term $A^{(l)}$, summed along the dimension $d^{(l)}_x$, from the U-Net model’s layers with output dimensions of $64 \times 64$, and (b) the predicted latent representation $\hat{z}_0$ at different stages of both the inversion and reconstruction processes. In (a), discrepancies in the cross-attention maps between the inversion and reconstruction phases are evident, with misalignment causing errors in image fidelity under the source and null prompt conditions. In (b), the reconstructed images show significant distortions under the source and null conditions, whereas our method consistently maintains high image quality throughout the reconstruction process.
  • Figure 3: Correlation between MSE of cross-attention term $A^{(l)}$ and clean image prediction $\hat{z}_{0}$ during inversion and reconstruction. The scatter plot shows that discrepancies in the cross-attention term $A^{(l)}_t$ from all U-Net model's layers with output dimensions of $64\times64$ during the inversion and reconstruction phases contribute significantly to the Mean Squared Error (MSE) in the predicted clean image $\hat{z}_{0,t}$, as evidenced by the positive correlation across 700 images from the PIE benchmark DBLP:journals/corr/abs-2310-01506.
  • Figure 4: The proposed tuning-free image editing framework. We find that using Uniform Cross-attention Maps yields excellent reconstruction results, as shown in Tab. \ref{['tab:novelty']}. We introduce an auxiliary branch and generate masks based on the differences between the source branch and the target branch to blend the results of the auxiliary branch. Our method effectively enhances the performance of existing image editing algorithms. The process of using Uniform Attention Maps is shown in \ref{['fig:motivation']} (b).
  • Figure 5: Qualitative comparison with SOTA and baselines in image composition task on TF-ICON bench mark. Our method generates images with higher fidelity to the reference images and produces more realistic results.
  • ...and 9 more figures