Table of Contents
Fetching ...

I2AM: Interpreting Image-to-Image Latent Diffusion Models via Bi-Attribution Maps

Junseo Park, Hyeryung Jang

TL;DR

I2AM introduces bidirectional Image-to-Image Attribution Maps to interpret cross-attention in I2I diffusion models, aggregating attention across time steps, heads, and layers to produce reference-to-generated and generated-to-reference maps. Key constructs include ULAM, TLAM, HLAM, LLAM, and SRAM, along with IMACS for evaluating alignment with inpainting masks. Empirical results across object detection, inpainting, and super-resolution demonstrate I2AM’s ability to reveal meaningful attribution patterns and its usefulness for debugging and model refinement, with IMACS showing strong correspondence to downstream metrics. The approach promises improved reliability and controllability of I2I diffusion systems and sets the stage for broader applications such as colorization and style transfer.

Abstract

Large-scale diffusion models have made significant advances in image generation, particularly through cross-attention mechanisms. While cross-attention has been well-studied in text-to-image tasks, their interpretability in image-to-image (I2I) diffusion models remains underexplored. This paper introduces Image-to-Image Attribution Maps (I2AM), a method that enhances the interpretability of I2I models by visualizing bidirectional attribution maps, from the reference image to the generated image and vice versa. I2AM aggregates cross-attention scores across time steps, attention heads, and layers, offering insights into how critical features are transferred between images. We demonstrate the effectiveness of I2AM across object detection, inpainting, and super-resolution tasks. Our results demonstrate that I2AM successfully identifies key regions responsible for generating the output, even in complex scenes. Additionally, we introduce the Inpainting Mask Attention Consistency Score (IMACS) as a novel evaluation metric to assess the alignment between attribution maps and inpainting masks, which correlates strongly with existing performance metrics. Through extensive experiments, we show that I2AM enables model debugging and refinement, providing practical tools for improving I2I model's performance and interpretability.

I2AM: Interpreting Image-to-Image Latent Diffusion Models via Bi-Attribution Maps

TL;DR

I2AM introduces bidirectional Image-to-Image Attribution Maps to interpret cross-attention in I2I diffusion models, aggregating attention across time steps, heads, and layers to produce reference-to-generated and generated-to-reference maps. Key constructs include ULAM, TLAM, HLAM, LLAM, and SRAM, along with IMACS for evaluating alignment with inpainting masks. Empirical results across object detection, inpainting, and super-resolution demonstrate I2AM’s ability to reveal meaningful attribution patterns and its usefulness for debugging and model refinement, with IMACS showing strong correspondence to downstream metrics. The approach promises improved reliability and controllability of I2I diffusion systems and sets the stage for broader applications such as colorization and style transfer.

Abstract

Large-scale diffusion models have made significant advances in image generation, particularly through cross-attention mechanisms. While cross-attention has been well-studied in text-to-image tasks, their interpretability in image-to-image (I2I) diffusion models remains underexplored. This paper introduces Image-to-Image Attribution Maps (I2AM), a method that enhances the interpretability of I2I models by visualizing bidirectional attribution maps, from the reference image to the generated image and vice versa. I2AM aggregates cross-attention scores across time steps, attention heads, and layers, offering insights into how critical features are transferred between images. We demonstrate the effectiveness of I2AM across object detection, inpainting, and super-resolution tasks. Our results demonstrate that I2AM successfully identifies key regions responsible for generating the output, even in complex scenes. Additionally, we introduce the Inpainting Mask Attention Consistency Score (IMACS) as a novel evaluation metric to assess the alignment between attribution maps and inpainting masks, which correlates strongly with existing performance metrics. Through extensive experiments, we show that I2AM enables model debugging and refinement, providing practical tools for improving I2I model's performance and interpretability.
Paper Structure (23 sections, 16 equations, 21 figures, 4 tables)

This paper contains 23 sections, 16 equations, 21 figures, 4 tables.

Figures (21)

  • Figure 1: Cross-attention maps using $\text{I}^2\text{AM}$. The top map shows how the generated image is influenced by the reference image (Q1), while the bottom map illustrates how the reference image contributes to the generated image (Q2). The right map highlights specific reference-to-output patch contributions.
  • Figure 2: Visualization of layer-level attribution maps (LLAM) for each task. (a) LLAMs for StableVITON and DCI-VTON models at layers $2$, $5$, and $8$ demonstrate how clothing features are progressively incorporated during the inpainting process. (b) LLAMs for PASD model show the contribution of reference data in refining image resolution at different layers.
  • Figure 3: (a) U-Net layer at time $t$ and layer $l$, where the image embeddings are supplied to the cross-attention. (b) U-Net encoder providing multi-scale image embeddings ${\mathbf c}_{{\mathbf I}}^{(l)}$; and (c) image encoder supplying fixed-size embeddings ${\mathbf c}_{{\mathbf I}}$ to the cross-attention module.
  • Figure 4: Overview of SRAM, showing how attention scores from all patch embeddings of the reference image (clothing) are calculated to analyze correlation with a specific generated patch $i$. The red point on the clothing indicates the reference patch with the highest influence on the generated image.
  • Figure 5: Object detection pipeline using PBE and $\text{I}^2\text{AM}$. ULAM for R2G direction highlights key regions for object detection in the generated image.
  • ...and 16 more figures