Table of Contents
Fetching ...

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

Bartosz Cywiński, Kamil Deja

TL;DR

SAeUron introduces an interpretable, activation-based unlearning framework for text-to-image diffusion models by training sparse autoencoders on cross-attention activations across denoising steps. It identifies a compact set of concept-specific features and ablates them during inference to remove targeted content while preserving overall generation quality, achieving state-of-the-art performance on UnlearnCanvas style unlearning and competitive object unlearning, with robust nudity removal on I2P. The method emphasizes transparency by linking features to human-interpretable concepts and demonstrates strong scalability to multiple concepts and resilience to adversarial prompts. While offering clear benefits in interpretability and efficiency, it notes limitations such as inference overhead, data storage needs for activations, and challenges with abstract or highly similar concepts.

Abstract

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at https://github.com/cywinski/SAeUron.

SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

TL;DR

SAeUron introduces an interpretable, activation-based unlearning framework for text-to-image diffusion models by training sparse autoencoders on cross-attention activations across denoising steps. It identifies a compact set of concept-specific features and ablates them during inference to remove targeted content while preserving overall generation quality, achieving state-of-the-art performance on UnlearnCanvas style unlearning and competitive object unlearning, with robust nudity removal on I2P. The method emphasizes transparency by linking features to human-interpretable concepts and demonstrates strong scalability to multiple concepts and resilience to adversarial prompts. While offering clear benefits in interpretability and efficiency, it notes limitations such as inference overhead, data storage needs for activations, and challenges with abstract or highly similar concepts.

Abstract

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at https://github.com/cywinski/SAeUron.

Paper Structure

This paper contains 45 sections, 7 equations, 35 figures, 8 tables, 2 algorithms.

Figures (35)

  • Figure 1: Concept unlearning in SAeUron. We localize and remove SAE features corresponding to the unwanted concept (Cartoon) while preserving the overall performance of the diffusion model.
  • Figure 2: Unlearning procedure in SAeUron. (a) Concept-specific features are selected for unlearning according to their importance scores. (b) During inference in the U-Net of the diffusion model, activation between selected cross-attention blocks is passed through a trained SAE. The selected SAE features are then ablated by scaling them with a negative multiplier $\gamma_c$, removing their influence on the final output. The remaining features are left unchanged, ensuring minimal impact on the overall model performance.
  • Figure 3: Feature importance scores. Most of the features have near-zero scores, indicating that SAE learns only a few concept-specific features. During the evaluation, we find the most important features according to this score and block them.
  • Figure 4: Object classification with k-nearest neighbors algorithm based on SAE feature activations. Features selected with our score-based selection approach demonstrate strong discriminative power across timesteps. Even randomly selected features exhibit notably higher accuracy than random guess baseline, proving that SAE learns meaningful visual attributes.
  • Figure 5: Activations of features selected for unlearning displayed on image patches. (Left) Features corresponding to the Bricks style strongly activate on patterns characteristic of this style. (Right) Conversely, Butterfly-related features activate successfully on image regions containing the object, regardless of the style.
  • ...and 30 more figures