Table of Contents
Fetching ...

Mitigating Memorization in LLMs using Activation Steering

Manan Suri, Nishit Anand, Amisha Bhaskar

TL;DR

The paper investigates activation steering as a forward-pass activation intervention to mitigate memorization in LLMs, using Sparse Autoencoder–derived steering vectors to modify layer activations with $\mathbf{a}_{\text{steered}} = \mathbf{a} + \alpha \cdot \beta \cdot \mathbf{v}_i$. Across a 40-book memorization benchmark and multiple tasks, higher steering strength and steering of later layers reduce memorization while incurring only modest degradation in language modeling and general abilities, with an optimal regime around $50 < |\beta| < 100$ and Layer 31 showing strong robustness. The study highlights the importance of layer selection and controlled semantic footprint, offering a practical alternative to data sanitization or retraining for safer LLM deployment. It also provides a framework for systematic exploration of activation-based interventions, including guidance on best practices and potential limitations such as dataset bias and entanglement between memorization and general knowledge.

Abstract

The memorization of training data by Large Language Models (LLMs) poses significant risks, including privacy leaks and the regurgitation of copyrighted content. Activation steering, a technique that directly intervenes in model activations, has emerged as a promising approach for manipulating LLMs. In this work, we explore the effectiveness of activation steering in reducing memorization while preserving generalization capabilities. We conduct empirical evaluations using a controlled memorization benchmark of literary material and demonstrate that our method successfully suppresses memorized content with minimal degradation in model performance in Gemma. Additionally, we analyze the trade-offs between suppression effectiveness and linguistic fluency, highlighting the advantages and limitations of activation-based interventions. Our findings contribute to ongoing efforts in developing safer and more privacy-preserving LLMs by providing a practical and efficient mechanism to mitigate unintended memorization.

Mitigating Memorization in LLMs using Activation Steering

TL;DR

The paper investigates activation steering as a forward-pass activation intervention to mitigate memorization in LLMs, using Sparse Autoencoder–derived steering vectors to modify layer activations with . Across a 40-book memorization benchmark and multiple tasks, higher steering strength and steering of later layers reduce memorization while incurring only modest degradation in language modeling and general abilities, with an optimal regime around and Layer 31 showing strong robustness. The study highlights the importance of layer selection and controlled semantic footprint, offering a practical alternative to data sanitization or retraining for safer LLM deployment. It also provides a framework for systematic exploration of activation-based interventions, including guidance on best practices and potential limitations such as dataset bias and entanglement between memorization and general knowledge.

Abstract

The memorization of training data by Large Language Models (LLMs) poses significant risks, including privacy leaks and the regurgitation of copyrighted content. Activation steering, a technique that directly intervenes in model activations, has emerged as a promising approach for manipulating LLMs. In this work, we explore the effectiveness of activation steering in reducing memorization while preserving generalization capabilities. We conduct empirical evaluations using a controlled memorization benchmark of literary material and demonstrate that our method successfully suppresses memorized content with minimal degradation in model performance in Gemma. Additionally, we analyze the trade-offs between suppression effectiveness and linguistic fluency, highlighting the advantages and limitations of activation-based interventions. Our findings contribute to ongoing efforts in developing safer and more privacy-preserving LLMs by providing a practical and efficient mechanism to mitigate unintended memorization.

Paper Structure

This paper contains 21 sections, 1 equation, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Evaluation of memorization, as a performance of ANLCS vs Steering Strength; for the default models, the strength represents the strength of steering in the parallel steered run, and is used as a variable to show spread of performance, but is not related to the actual performance.
  • Figure 2: Ratio of steered vs default setting's perplexity, with varying strength.
  • Figure 3: Comparison of default setting (average score across runs represented by - -) with steered models at different layers of intervention on LLM performance benchmarks, Score vs Steering Strength.
  • Figure 4: Qualitative examples from different layers, at different steering strengths. Examples are evaluated based on their ability to mitigate memorization, as well as language modeling. All systems have been prompted with the Harry Potter prompt.
  • Figure 5: Qualitative demonstrating examples with high semantic footprint. All systems have been prompted with the Harry Potter prompt.