Table of Contents
Fetching ...

Explaining generative diffusion models via visual analysis for interpretable decision-making process

Ji-Hoon Park, Yeong-Joon Ju, Seong-Whan Lee

TL;DR

This work devise tools for visualizing the diffusion process and answering the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step to render the diffusion process human-understandable.

Abstract

Diffusion models have demonstrated remarkable performance in generation tasks. Nevertheless, explaining the diffusion process remains challenging due to it being a sequence of denoising noisy images that are difficult for experts to interpret. To address this issue, we propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable. We show how the output is progressively generated in the diffusion process by explaining the level of denoising and highlighting relationships to foundational visual concepts at each time step through the results of experiments with various visual analyses using the tools. Throughout the training of the diffusion model, the model learns diverse visual concepts corresponding to each time-step, enabling the model to predict varying levels of visual concepts at different stages. We substantiate our tools using Area Under Cover (AUC) score, correlation quantification, and cross-attention mapping. Our findings provide insights into the diffusion process and pave the way for further research into explainable diffusion mechanisms.

Explaining generative diffusion models via visual analysis for interpretable decision-making process

TL;DR

This work devise tools for visualizing the diffusion process and answering the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step to render the diffusion process human-understandable.

Abstract

Diffusion models have demonstrated remarkable performance in generation tasks. Nevertheless, explaining the diffusion process remains challenging due to it being a sequence of denoising noisy images that are difficult for experts to interpret. To address this issue, we propose the three research questions to interpret the diffusion process from the perspective of the visual concepts generated by the model and the region where the model attends in each time step. We devise tools for visualizing the diffusion process and answering the aforementioned research questions to render the diffusion process human-understandable. We show how the output is progressively generated in the diffusion process by explaining the level of denoising and highlighting relationships to foundational visual concepts at each time step through the results of experiments with various visual analyses using the tools. Throughout the training of the diffusion model, the model learns diverse visual concepts corresponding to each time-step, enabling the model to predict varying levels of visual concepts at different stages. We substantiate our tools using Area Under Cover (AUC) score, correlation quantification, and cross-attention mapping. Our findings provide insights into the diffusion process and pave the way for further research into explainable diffusion mechanisms.
Paper Structure (12 sections, 14 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 12 sections, 14 equations, 11 figures, 3 tables, 1 algorithm.

Figures (11)

  • Figure 1: Visualization of the image using saliency map and exponential sampling. The images are generated from the prompt of "a doctor singing a song". (a) Image generated using the original scheduler sampling which has the same interval. (b) image focusing on the early stage (e.g. from $1000$ to $800$). (c) Image focusing on the latter stage (e.g. from $200$ to $0$). The heat map shows the decision-making process of models in a particular step.
  • Figure 2: Overview of our proposed DF-RISE framework. The framework includes a masking method and a structural similarity function that is applicable to the diffusion generative model. The saliency map is expressed using heatmaps.
  • Figure 3: Qualitative evaluation DF-RISE with baseline We compare DF-RISE visualization to the LIME for qualitative evaluation. While the LIME depends on the segmentation region, DF-RISE visualizes the decision process unaffected by external factors.
  • Figure 4: Comparisons for deletion game and insertion game with baselines We evaluate the deletion and insertion game for DF-CAM and DF-RISE with baseline and random perturbations. We delete or insert data from high to low activation. The deletion graph is in the first column, and the insertion graph is in the second column. The initial derivative of the curve indicates whether the key information is identified.
  • Figure 5: Ablation study. We generate an image of 'an astronaut riding a horse on Mars'. We compare the similarity function with FID score, Cosine similarity, luminance, and contrast. The first column is the heat map output, the second column is the result of the deletion game, and the third column is the result of the insertion game.
  • ...and 6 more figures