Table of Contents
Fetching ...

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

Matthew Khoriaty, Andrii Shportko, Gustavo Mercier, Zach Wood-Doughty

TL;DR

This work advances SAE-based unlearning for LLM safety by introducing conditional steering (alignment-driven) to suppress harmful knowledge while preserving benign capabilities. It compares RMU against two SAE-based clamps—Clamp Prime and Refusal Clamp—on the gemma-2-2b model using the WMDP-Bio forget/retain framework and a retained MMLU evaluation, demonstrating that Refusal Clamp achieves the highest alignment with notable forgetting and acceptable retention costs. A novel alignment metric combines improvements on harmful content suppression with retention of safe knowledge, and a compact Steering CSV representation enables efficient communication of SAE edits. Overall, the approach provides interpretable, data-driven control over internal representations with practical implications for safer AI deployment, though it acknowledges computational costs and the need for broader adversarial testing.

Abstract

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the wrong hands or during malfunctions. Because of their nature as near-black boxes, intuitive interpretation of LLM internals remains an open research question, preventing developers from easily controlling model behavior and capabilities. The use of Sparse Autoencoders (SAEs) has recently emerged as a potential method of unraveling representations of concepts in LLMs internals, and has allowed developers to steer model outputs by directly modifying the hidden activations. In this paper, we use SAEs to identify unwanted concepts from the Weapons of Mass Destruction Proxy (WMDP) dataset within gemma-2-2b internals and use feature steering to reduce the model's ability to answer harmful questions while retaining its performance on harmless queries. Our results bring back optimism to the viability of SAE-based explicit knowledge unlearning techniques.

Don't Forget It! Conditional Sparse Autoencoder Clamping Works for Unlearning

TL;DR

This work advances SAE-based unlearning for LLM safety by introducing conditional steering (alignment-driven) to suppress harmful knowledge while preserving benign capabilities. It compares RMU against two SAE-based clamps—Clamp Prime and Refusal Clamp—on the gemma-2-2b model using the WMDP-Bio forget/retain framework and a retained MMLU evaluation, demonstrating that Refusal Clamp achieves the highest alignment with notable forgetting and acceptable retention costs. A novel alignment metric combines improvements on harmful content suppression with retention of safe knowledge, and a compact Steering CSV representation enables efficient communication of SAE edits. Overall, the approach provides interpretable, data-driven control over internal representations with practical implications for safer AI deployment, though it acknowledges computational costs and the need for broader adversarial testing.

Abstract

Recent developments in Large Language Model (LLM) capabilities have brought great potential but also posed new risks. For example, LLMs with knowledge of bioweapons, advanced chemistry, or cyberattacks could cause violence if placed in the wrong hands or during malfunctions. Because of their nature as near-black boxes, intuitive interpretation of LLM internals remains an open research question, preventing developers from easily controlling model behavior and capabilities. The use of Sparse Autoencoders (SAEs) has recently emerged as a potential method of unraveling representations of concepts in LLMs internals, and has allowed developers to steer model outputs by directly modifying the hidden activations. In this paper, we use SAEs to identify unwanted concepts from the Weapons of Mass Destruction Proxy (WMDP) dataset within gemma-2-2b internals and use feature steering to reduce the model's ability to answer harmful questions while retaining its performance on harmless queries. Our results bring back optimism to the viability of SAE-based explicit knowledge unlearning techniques.

Paper Structure

This paper contains 31 sections, 2 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Effect of top-k selected features on the Alignment with all other hyperparameters being equal
  • Figure 2: Refusal Clamp algorithm
  • Figure 3: Pareto frontiers for 3 procedures along with top performing models. The isometres reflect the lines of equal Alignment score.
  • Figure 4: Average chance of the latent to be non-zero vs frequency of such latents in the layer $7$
  • Figure 5: "Clamp Prime" algorithm
  • ...and 1 more figures