SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Enrico Cassano; Riccardo Renzulli; Marco Nurisso; Mirko Zaffaroni; Alan Perotti; Marco Grangetto

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto

TL;DR

SAEmnesia tackles concept unlearning in diffusion models by enforcing one-to-one concept–latent mappings through a supervised sparse autoencoder, addressing feature splitting and polysemanticity. It binds each concept to a dedicated latent, enabling targeted erasure with preserved unrelated content and substantially reducing hyperparameter search. On UnlearnCanvas, it achieves state-of-the-art unlearning performance, scales to sequential object removal, and demonstrates robustness to adversarial prompts and NSFW content controls. The work provides a principled, interpretable framework for controllable concept manipulation in generative models with practical safety implications.

Abstract

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. SAEmnesia reduces hyperparameter search by 96.7% and achieves a 9.2% improvement over the state-of-the-art on the UnlearnCanvas benchmark. Our method also demonstrates superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a new standard for precise and controllable concept erasure. Moreover, SAEmnesia mitigates the possibility of generating unwanted content under adversarial attack and effectively removes nudity when evaluated with I2P.

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

TL;DR

Abstract

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (18)