Table of Contents
Fetching ...

DiffusionPID: Interpreting Diffusion via Partial Information Decomposition

Rushikesh Zawar, Shaurya Dewan, Prakanshul Saxena, Yingshan Chang, Andrew Luo, Yonatan Bisk

TL;DR

This work presents Diffusion Partial Information Decomposition (DiffusionPID), a novel technique that applies information-theoretic principles to decompose the input text prompt into its elementary components, enabling a detailed examination of how individual tokens and their interactions shape the generated image.

Abstract

Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding what they learn, how they represent visual-semantic relationships, and why they sometimes fail to generalize. Our work presents Diffusion Partial Information Decomposition (DiffusionPID), a novel technique that applies information-theoretic principles to decompose the input text prompt into its elementary components, enabling a detailed examination of how individual tokens and their interactions shape the generated image. We introduce a formal approach to analyze the uniqueness, redundancy, and synergy terms by applying PID to the denoising model at both the image and pixel level. This approach enables us to characterize how individual tokens and their interactions affect the model output. We first present a fine-grained analysis of characteristics utilized by the model to uniquely localize specific concepts, we then apply our approach in bias analysis and show it can recover gender and ethnicity biases. Finally, we use our method to visually characterize word ambiguity and similarity from the model's perspective and illustrate the efficacy of our method for prompt intervention. Our results show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.

DiffusionPID: Interpreting Diffusion via Partial Information Decomposition

TL;DR

This work presents Diffusion Partial Information Decomposition (DiffusionPID), a novel technique that applies information-theoretic principles to decompose the input text prompt into its elementary components, enabling a detailed examination of how individual tokens and their interactions shape the generated image.

Abstract

Text-to-image diffusion models have made significant progress in generating naturalistic images from textual inputs, and demonstrate the capacity to learn and represent complex visual-semantic relationships. While these diffusion models have achieved remarkable success, the underlying mechanisms driving their performance are not yet fully accounted for, with many unanswered questions surrounding what they learn, how they represent visual-semantic relationships, and why they sometimes fail to generalize. Our work presents Diffusion Partial Information Decomposition (DiffusionPID), a novel technique that applies information-theoretic principles to decompose the input text prompt into its elementary components, enabling a detailed examination of how individual tokens and their interactions shape the generated image. We introduce a formal approach to analyze the uniqueness, redundancy, and synergy terms by applying PID to the denoising model at both the image and pixel level. This approach enables us to characterize how individual tokens and their interactions affect the model output. We first present a fine-grained analysis of characteristics utilized by the model to uniquely localize specific concepts, we then apply our approach in bias analysis and show it can recover gender and ethnicity biases. Finally, we use our method to visually characterize word ambiguity and similarity from the model's perspective and illustrate the efficacy of our method for prompt intervention. Our results show that PID is a potent tool for evaluating and diagnosing text-to-image diffusion models.
Paper Structure (30 sections, 17 equations, 42 figures, 2 tables)

This paper contains 30 sections, 17 equations, 42 figures, 2 tables.

Figures (42)

  • Figure 1: Concept Figure.Left: Our "baseball" uniqueness map specifically highlights the seam region of the tennis ball as it is visually very similar to that of a baseball, Center: We see a high synergy for "bat" with "baseball" and "overhead" respectively which shows that it uses these contextual cues to generate the images in the right settings, Right: Our redundancy map between "queen" and "crown" correctly focuses on the crown and facial region.
  • Figure 2: Homonyms. We see that the word "baseball" provides the required synergistic context with the homonym "bat" to pick the sports setting over the animal. This effect can be confirmed in the synergy maps and image-level synergy values (S) as well where we observe a high synergy for "bat" with "baseball" compared to other words such as "He" and "swung".
  • Figure 3: Homonyms.Left: Successful generation of homonym "bowl" in different contexts due to high synergy with modifiers "bowl" and "game". Right: Failure case where the model generates the homonym "mole" with the same semantic meaning, the animal, due to its failure to use contextual information from words like "coworker" as can be seen in the synergy map.
  • Figure 4: Synonyms. Our redundancy map is able to highlight that the model considers the pairs "bed" and "mattress" (left) and "cube" and "cuboid" (right) as semantically similar.
  • Figure 5: COCO Co-Hyponyms. The redundancy map proves to be very useful in finding out the reason behind the model's failures in these figures. It confuses the co-hyponym pairs ("sandwich", "pizza")(left) and ("elephant", "cat")(right) to have the same meaning for the co-hyponyms as seen from the redundancy maps, which results in erroneous generations.
  • ...and 37 more figures