Table of Contents
Fetching ...

Enhancing Conditional Image Generation with Explainable Latent Space Manipulation

Kshitij Pathania

TL;DR

This work addresses faithfully reconstructing reference context while following conditional prompts in diffusion-based image synthesis. It introduces Abs-Grad-SAM, adapting Grad-SAM to cross-attention gradients in latent diffusion to compute subject-specific importance and generate masks that guide selective latent-space background replacement during denoising. By applying Gaussian smoothing and dilation to these masks and performing targeted latent manipulation at selected timesteps, the method achieves improved fidelity (FID) and competitive textual alignment (CLIP) on Place365, with $\mathrm{FID}_{mean}=6.89$, $\mathrm{FID}_{med}=5.32$ and $\mathrm{CLIP}_{mean}=27.98$, $\mathrm{CLIP}_{med}=28.24$. The approach offers a principled, explainable mechanism for controlling subject formation and background integration in text-to-image synthesis, advancing practical controllability in diffusion-based generation.

Abstract

In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.

Enhancing Conditional Image Generation with Explainable Latent Space Manipulation

TL;DR

This work addresses faithfully reconstructing reference context while following conditional prompts in diffusion-based image synthesis. It introduces Abs-Grad-SAM, adapting Grad-SAM to cross-attention gradients in latent diffusion to compute subject-specific importance and generate masks that guide selective latent-space background replacement during denoising. By applying Gaussian smoothing and dilation to these masks and performing targeted latent manipulation at selected timesteps, the method achieves improved fidelity (FID) and competitive textual alignment (CLIP) on Place365, with , and , . The approach offers a principled, explainable mechanism for controlling subject formation and background integration in text-to-image synthesis, advancing practical controllability in diffusion-based generation.

Abstract

In the realm of image synthesis, achieving fidelity to a reference image while adhering to conditional prompts remains a significant challenge. This paper proposes a novel approach that integrates a diffusion model with latent space manipulation and gradient-based selective attention mechanisms to address this issue. Leveraging Grad-SAM (Gradient-based Selective Attention Manipulation), we analyze the cross attention maps of the cross attention layers and gradients for the denoised latent vector, deriving importance scores of elements of denoised latent vector related to the subject of interest. Using this information, we create masks at specific timesteps during denoising to preserve subjects while seamlessly integrating the reference image features. This approach ensures the faithful formation of subjects based on conditional prompts, while concurrently refining the background for a more coherent composition. Our experiments on places365 dataset demonstrate promising results, with our proposed model achieving the lowest mean and median Frechet Inception Distance (FID) scores compared to baseline models, indicating superior fidelity preservation. Furthermore, our model exhibits competitive performance in aligning the generated images with provided textual descriptions, as evidenced by high CLIP scores. These results highlight the effectiveness of our approach in both fidelity preservation and textual context preservation, offering a significant advancement in text-to-image synthesis tasks.
Paper Structure (16 sections, 9 equations, 5 figures, 1 table, 1 algorithm)

This paper contains 16 sections, 9 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: When a reference image is provided alongside a prompt to a diffusion model, SD implementations often struggle to preserve contextual details from the reference. This is primarily due to excessive noise introduction during forward process of DDPM, leading to a loss of contextual fidelity. Conversely, reducing noise can compromise prompt adherence by limiting reconstruction time as evident from the image produced by SD(Reduced Noise). Our proposed model addresses this challenge by maintaining a high number of noise steps while preserving context through targeted replacement of less attended elements in the latent vector. As depicted, our model successfully balances both contextual preservation and prompt adherence.
  • Figure 2: Flowchart illustrating the Abs-Grad-SAM-based latent space manipulation technique for enhancing image generation performance.
  • Figure 3: Comparison of FID scores for images generated by different models across different scenes and categories.
  • Figure 4: Comparison of CLIP scores for images generated by different models across different scenes and categories..
  • Figure 5: A Comparative analysis.