Mitigating Memorization in LLMs using Activation Steering
Manan Suri, Nishit Anand, Amisha Bhaskar
TL;DR
The paper investigates activation steering as a forward-pass activation intervention to mitigate memorization in LLMs, using Sparse Autoencoder–derived steering vectors to modify layer activations with $\mathbf{a}_{\text{steered}} = \mathbf{a} + \alpha \cdot \beta \cdot \mathbf{v}_i$. Across a 40-book memorization benchmark and multiple tasks, higher steering strength and steering of later layers reduce memorization while incurring only modest degradation in language modeling and general abilities, with an optimal regime around $50 < |\beta| < 100$ and Layer 31 showing strong robustness. The study highlights the importance of layer selection and controlled semantic footprint, offering a practical alternative to data sanitization or retraining for safer LLM deployment. It also provides a framework for systematic exploration of activation-based interventions, including guidance on best practices and potential limitations such as dataset bias and entanglement between memorization and general knowledge.
Abstract
The memorization of training data by Large Language Models (LLMs) poses significant risks, including privacy leaks and the regurgitation of copyrighted content. Activation steering, a technique that directly intervenes in model activations, has emerged as a promising approach for manipulating LLMs. In this work, we explore the effectiveness of activation steering in reducing memorization while preserving generalization capabilities. We conduct empirical evaluations using a controlled memorization benchmark of literary material and demonstrate that our method successfully suppresses memorized content with minimal degradation in model performance in Gemma. Additionally, we analyze the trade-offs between suppression effectiveness and linguistic fluency, highlighting the advantages and limitations of activation-based interventions. Our findings contribute to ongoing efforts in developing safer and more privacy-preserving LLMs by providing a practical and efficient mechanism to mitigate unintended memorization.
