Table of Contents
Fetching ...

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto

TL;DR

SAEmnesia tackles concept unlearning in diffusion models by enforcing one-to-one concept–latent mappings through a supervised sparse autoencoder, addressing feature splitting and polysemanticity. It binds each concept to a dedicated latent, enabling targeted erasure with preserved unrelated content and substantially reducing hyperparameter search. On UnlearnCanvas, it achieves state-of-the-art unlearning performance, scales to sequential object removal, and demonstrates robustness to adversarial prompts and NSFW content controls. The work provides a principled, interpretable framework for controllable concept manipulation in generative models with practical safety implications.

Abstract

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. SAEmnesia reduces hyperparameter search by 96.7% and achieves a 9.2% improvement over the state-of-the-art on the UnlearnCanvas benchmark. Our method also demonstrates superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a new standard for precise and controllable concept erasure. Moreover, SAEmnesia mitigates the possibility of generating unwanted content under adversarial attack and effectively removes nudity when evaluated with I2P.

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

TL;DR

SAEmnesia tackles concept unlearning in diffusion models by enforcing one-to-one concept–latent mappings through a supervised sparse autoencoder, addressing feature splitting and polysemanticity. It binds each concept to a dedicated latent, enabling targeted erasure with preserved unrelated content and substantially reducing hyperparameter search. On UnlearnCanvas, it achieves state-of-the-art unlearning performance, scales to sequential object removal, and demonstrates robustness to adversarial prompts and NSFW content controls. The work provides a principled, interpretable framework for controllable concept manipulation in generative models with practical safety implications.

Abstract

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. SAEmnesia reduces hyperparameter search by 96.7% and achieves a 9.2% improvement over the state-of-the-art on the UnlearnCanvas benchmark. Our method also demonstrates superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a new standard for precise and controllable concept erasure. Moreover, SAEmnesia mitigates the possibility of generating unwanted content under adversarial attack and effectively removes nudity when evaluated with I2P.

Paper Structure

This paper contains 21 sections, 11 equations, 18 figures, 9 tables.

Figures (18)

  • Figure 1: SAEmnesia enables precise concept-level manipulation: each latent activates for a single concept (monosemanticity), and each concept is embedded in a single latent (feature centralization). So, to erase a target concept, we only need to steer a single latent. The removed concepts correctly disappear in the diagonal images ("Architectures", "Bears", "Sandwiches") while the corresponding style is preserved. Note that they remain present in the non-diagonal ones, thereby preserving the fidelity and diversity when unlearning unrelated content.
  • Figure 2: SAEmnesia pipeline. Training comprises two phases: (i) establishing sparse representations via standard unsupervised SAE training, (ii) applying supervised losses to strengthen specific concept-neuron associations. During inference, we need to steer a single latent per concept.
  • Figure 3: Effect of uniform multipliers on unlearning performance for SAeUron and SAEmnesia. SAEmnesia maintains higher and more stable performance across all multipliers compared to SAeUron, indicating greater robustness to the steering strength.
  • Figure 4: Feature importance score distributions for "Flowers" concept. SAeUron shows dispersed, low-magnitude scores across all neurons with a maximum of 0.0166. SAEmnesia shows a clear dominant peak at neuron 11979 with maximum score of 0.0404 (2.43$\times$ improvement).
  • Figure 5: K-NN classification across denoising timesteps, averaged over 20 object concepts. Using only the top-scoring latent identified by SAEmnesia, performances are similiar to using all features, demonstrating that supervised training successfully concentrates concept-relevant information into single interpretable latents across diverse object categories.
  • ...and 13 more figures