Precise In-Parameter Concept Erasure in Large Language Models
Yoav Gur-Arieh, Clara Suslik, Yihuai Hong, Fazl Barez, Mor Geva
TL;DR
The paper addresses the challenge of removing undesirable knowledge from large language models without sacrificing utility. It introduces PISCES, a framework that disentangles MLP parameter directions encoding a target concept using a sparse autoencoder, then identifies concept-related features via vocabulary projection and edits the corresponding MLP vectors to suppress the concept. Empirical results on Gemma-2 and Llama-3.1 show that PISCES achieves competitive efficacy while substantially improving specificity and robustness to relearning, indicating more durable and precise concept erasure than baselines. The work demonstrates the feasibility and benefits of in-parameter editing for safe deployment of LLMs, and discusses limitations and avenues for future improvements, including extending to other parameter components and exploring supervised disentanglement.
Abstract
Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.
