Table of Contents
Fetching ...

Unveiling Concept Attribution in Diffusion Models

Quang H. Nguyen, Hoang Phan, Khoa D. Doan

TL;DR

The paper tackles the interpretability challenge of diffusion models by revealing that concept generation is controlled by a small set of fine-grained parameters, including both positive components that store knowledge and negative components that suppress it. It introduces CAD, a first-order, linear-counterfactual attribution framework that estimates per-parameter contributions to a target concept with a single forward-backward pass, enabling lightweight, inference-time edits. The authors implement two editing algorithms, CAD-Erase and CAD-Amplify, to erase or amplify concepts by ablating identified positive or negative components, respectively, and validate the approach across multiple diffusion variants and concepts (objects, styles, nudity). They demonstrate that knowledge is highly localized, that negative components exist and can be leveraged to amplify target concepts, and that editing with CAD can remove or recall concepts while preserving unrelated knowledge. The work advances trustworthy diffusion modeling by providing a complete, parameter-level view of concept attribution and practical editing tools, with implications for safety and controllability in real-world deployments.

Abstract

Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.

Unveiling Concept Attribution in Diffusion Models

TL;DR

The paper tackles the interpretability challenge of diffusion models by revealing that concept generation is controlled by a small set of fine-grained parameters, including both positive components that store knowledge and negative components that suppress it. It introduces CAD, a first-order, linear-counterfactual attribution framework that estimates per-parameter contributions to a target concept with a single forward-backward pass, enabling lightweight, inference-time edits. The authors implement two editing algorithms, CAD-Erase and CAD-Amplify, to erase or amplify concepts by ablating identified positive or negative components, respectively, and validate the approach across multiple diffusion variants and concepts (objects, styles, nudity). They demonstrate that knowledge is highly localized, that negative components exist and can be leveraged to amplify target concepts, and that editing with CAD can remove or recall concepts while preserving unrelated knowledge. The work advances trustworthy diffusion modeling by providing a complete, parameter-level view of concept attribution and practical editing tools, with implications for safety and controllability in real-world deployments.

Abstract

Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.

Paper Structure

This paper contains 25 sections, 5 equations, 16 figures, 12 tables, 2 algorithms.

Figures (16)

  • Figure 1: Overview of our framework. We show that there exist positive and negative components in diffusion that increase or decrease the probability of the target concept, respectively. Removing those components will have the reverse effect.
  • Figure 2: The attribution scores predicted by CAD and the actual values of the objective.
  • Figure 3: The qualitative results of CAD. Removing positive components to "English Springer" avoids generating that concept. Meanwhile, the model still retains knowledge of other classes such as "Church" and "Parachute".
  • Figure 4: The first two rows contain images generated by the original model and erasing methods on I2P prompts. We add * for publication. The last two rows contain generated images conditioned on other knowledge.
  • Figure 5: Qualitative results of CAD on erasing artist styles. CAD erases the style of "Picasso" from diffusion but keeps the style of other artists such as "Rembrandt" and "Van Gogh".
  • ...and 11 more figures