Unveiling Concept Attribution in Diffusion Models
Quang H. Nguyen, Hoang Phan, Khoa D. Doan
TL;DR
The paper tackles the interpretability challenge of diffusion models by revealing that concept generation is controlled by a small set of fine-grained parameters, including both positive components that store knowledge and negative components that suppress it. It introduces CAD, a first-order, linear-counterfactual attribution framework that estimates per-parameter contributions to a target concept with a single forward-backward pass, enabling lightweight, inference-time edits. The authors implement two editing algorithms, CAD-Erase and CAD-Amplify, to erase or amplify concepts by ablating identified positive or negative components, respectively, and validate the approach across multiple diffusion variants and concepts (objects, styles, nudity). They demonstrate that knowledge is highly localized, that negative components exist and can be leveraged to amplify target concepts, and that editing with CAD can remove or recall concepts while preserving unrelated knowledge. The work advances trustworthy diffusion modeling by providing a complete, parameter-level view of concept attribution and practical editing tools, with implications for safety and controllability in real-world deployments.
Abstract
Diffusion models have shown remarkable abilities in generating realistic and high-quality images from text prompts. However, a trained model remains largely black-box; little do we know about the roles of its components in exhibiting a concept such as objects or styles. Recent works employ causal tracing to localize knowledge-storing layers in generative models without showing how other layers contribute to the target concept. In this work, we approach diffusion models' interpretability problem from a more general perspective and pose a question: \textit{``How do model components work jointly to demonstrate knowledge?''}. To answer this question, we decompose diffusion models using component attribution, systematically unveiling the importance of each component (specifically the model parameter) in generating a concept. The proposed framework, called \textbf{C}omponent \textbf{A}ttribution for \textbf{D}iffusion Model (CAD), discovers the localization of concept-inducing (positive) components, while interestingly uncovers another type of components that contribute negatively to generating a concept, which is missing in the previous knowledge localization work. Based on this holistic understanding of diffusion models, we introduce two fast, inference-time model editing algorithms, CAD-Erase and CAD-Amplify; in particular, CAD-Erase enables erasure and CAD-Amplify allows amplification of a generated concept by ablating the positive and negative components, respectively, while retaining knowledge of other concepts. Extensive experimental results validate the significance of both positive and negative components pinpointed by our framework, demonstrating the potential of providing a complete view of interpreting generative models. Our code is available \href{https://github.com/mail-research/CAD-attribution4diffusion}{here}.
