Table of Contents
Fetching ...

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

Aashiq Muhamed, Jacopo Bonato, Mona Diab, Virginia Smith

TL;DR

This work tackles the problem of unlearning specific knowledge from large language models, highlighting the shortcomings of gradient-based approaches in terms of efficiency, stability, sequential forgetting, relearning resilience, data efficiency, and interpretability. It introduces Dynamic SAE Guardrails (DSG), which leverage Sparse Autoencoders (SAEs) and Fisher Information-guided feature selection to identify causal mediators, and a dynamic, input-dependent classifier that conditionally clamps these features during inference. DSG achieves superior forget-utility trade-offs across benchmarks (WMDP, Muse), with demonstrated gains in computational efficiency (forward passes only), robustness to sequential unlearning, and resilience against relearning attacks, while providing interpretable, zero-shot capable feature explanations via SAE activations. The approach shows strong practical potential for safe AI deployment, enabling precise, efficient, and interpretable unlearning in production settings, including zero-shot domains and data-scarce scenarios.

Abstract

Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce $\textbf{Dynamic DAE Guardrails}$ (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

SAEs $\textit{Can}$ Improve Unlearning: Dynamic Sparse Autoencoder Guardrails for Precision Unlearning in LLMs

TL;DR

This work tackles the problem of unlearning specific knowledge from large language models, highlighting the shortcomings of gradient-based approaches in terms of efficiency, stability, sequential forgetting, relearning resilience, data efficiency, and interpretability. It introduces Dynamic SAE Guardrails (DSG), which leverage Sparse Autoencoders (SAEs) and Fisher Information-guided feature selection to identify causal mediators, and a dynamic, input-dependent classifier that conditionally clamps these features during inference. DSG achieves superior forget-utility trade-offs across benchmarks (WMDP, Muse), with demonstrated gains in computational efficiency (forward passes only), robustness to sequential unlearning, and resilience against relearning attacks, while providing interpretable, zero-shot capable feature explanations via SAE activations. The approach shows strong practical potential for safe AI deployment, enabling precise, efficient, and interpretable unlearning in production settings, including zero-shot domains and data-scarce scenarios.

Abstract

Machine unlearning is a promising approach to improve LLM safety by removing unwanted knowledge from the model. However, prevailing gradient-based unlearning methods suffer from issues such as high computational costs, hyperparameter instability, poor sequential unlearning capability, vulnerability to relearning attacks, low data efficiency, and lack of interpretability. While Sparse Autoencoders are well-suited to improve these aspects by enabling targeted activation-based unlearning, prior approaches underperform gradient-based methods. This work demonstrates that, contrary to these earlier findings, SAEs can significantly improve unlearning when employed dynamically. We introduce (DSG), a novel method for precision unlearning that leverages principled feature selection and a dynamic classifier. Our experiments show DSG substantially outperforms leading unlearning methods, achieving superior forget-utility trade-offs. DSG addresses key drawbacks of gradient-based approaches for unlearning -- offering enhanced computational efficiency and stability, robust performance in sequential unlearning, stronger resistance to relearning attacks, better data efficiency including zero-shot settings, and more interpretable unlearning.

Paper Structure

This paper contains 108 sections, 7 theorems, 34 equations, 13 figures, 16 tables, 1 algorithm.

Key Result

Theorem 1

For an SAE with small reconstruction error and input $\mathbf{h}$, the expected squared gradient of reconstruction loss with respect to feature $j$'s decoder weights $\boldsymbol{\theta}_{j,\cdot}$ is proportional to the second moment of that feature's activation: ${ \mathbb{E}_{\mathbf{h}} [ \|\nab

Figures (13)

  • Figure 1: An illustration of DSG
  • Figure 2: Distribution of $\rho(x)$ for unlearning on WMDP-Bio. Threshold at 95th percentile (dashed red line) separates MMLU from WMDP.
  • Figure 3: Unlearning performance on WMDP-Bio (left) and WMDP-Cyber (right). Higher MMLU accuracy and lower WMDP accuracy is better. Clamp strengths ($c$) used for DSG points are shown as annotations. DSG Pareto-dominates the top four baseline methods (RMU, SCRUB, Farrell et al., SSD).
  • Figure 4: (a) Scalability: Performance across increasing forget set sizes. (b) Sequential Unlearning: Performance across sequential unlearning requests
  • Figure 5: Relearning attack resistance across finetuning epochs. (a) DSG demonstrates superior resistance to relearning compared to RMU. (b) Test-time DSG preserves MMLU utility better than Train-time DSG while still providing significant protection.
  • ...and 8 more figures

Theorems & Definitions (12)

  • Theorem 1: Fisher Information Approximation
  • Theorem 2: Fisher Information as a Proxy for Causal Feature Importance
  • Definition 1: Forget-Set Activated Token
  • Theorem 3: Neyman-Pearson Optimality
  • Theorem 1: Approximate Fisher Information from SAE Features
  • proof
  • Theorem 2: Fisher Information as a Proxy for Causal Feature Importance
  • proof
  • Theorem 3: Neyman-Pearson Optimality
  • proof
  • ...and 2 more