Table of Contents
Fetching ...

Precise In-Parameter Concept Erasure in Large Language Models

Yoav Gur-Arieh, Clara Suslik, Yihuai Hong, Fazl Barez, Mor Geva

TL;DR

The paper addresses the challenge of removing undesirable knowledge from large language models without sacrificing utility. It introduces PISCES, a framework that disentangles MLP parameter directions encoding a target concept using a sparse autoencoder, then identifies concept-related features via vocabulary projection and edits the corresponding MLP vectors to suppress the concept. Empirical results on Gemma-2 and Llama-3.1 show that PISCES achieves competitive efficacy while substantially improving specificity and robustness to relearning, indicating more durable and precise concept erasure than baselines. The work demonstrates the feasibility and benefits of in-parameter editing for safe deployment of LLMs, and discusses limitations and avenues for future improvements, including extending to other parameter components and exploring supervised disentanglement.

Abstract

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

Precise In-Parameter Concept Erasure in Large Language Models

TL;DR

The paper addresses the challenge of removing undesirable knowledge from large language models without sacrificing utility. It introduces PISCES, a framework that disentangles MLP parameter directions encoding a target concept using a sparse autoencoder, then identifies concept-related features via vocabulary projection and edits the corresponding MLP vectors to suppress the concept. Empirical results on Gemma-2 and Llama-3.1 show that PISCES achieves competitive efficacy while substantially improving specificity and robustness to relearning, indicating more durable and precise concept erasure than baselines. The work demonstrates the feasibility and benefits of in-parameter editing for safe deployment of LLMs, and discusses limitations and avenues for future improvements, including extending to other parameter components and exploring supervised disentanglement.

Abstract

Large language models (LLMs) often acquire knowledge during pretraining that is undesirable in downstream deployments, e.g., sensitive information or copyrighted content. Existing approaches for removing such knowledge rely on fine-tuning, training low-rank adapters or fact-level editing, but these are either too coarse, too shallow, or ineffective. In this work, we propose PISCES (Precise In-parameter Suppression for Concept EraSure), a novel framework for precisely erasing entire concepts from model parameters by directly editing directions that encode them in parameter space. PISCES uses a disentangler model to decompose MLP vectors into interpretable features, identifies those associated with a target concept using automated interpretability techniques, and removes them from model parameters. Experiments on Gemma 2 and Llama 3.1 over various concepts show that PISCES achieves modest gains in efficacy over leading erasure methods, reducing accuracy on the target concept to as low as 7.7%, while dramatically improving erasure specificity (by up to 31%) and robustness (by up to 38%). Overall, these results demonstrate that feature-based in-parameter editing enables a more precise and reliable approach for removing conceptual knowledge in language models.

Paper Structure

This paper contains 49 sections, 6 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: $\mathtt{PISCES}$ disentangles model parameters to identify those encoding a target concept (e.g. Harry Potter). It then edits those disentangled parameters to precisely remove the target concept, before reconstructing them and finally replacing them in the model.
  • Figure 2: Sampled questions about erased concepts with responses generated by models post unlearning by $\mathtt{PISCES}$, ELM and RMU, as well as the baseline response. Erased concepts are Harry Potter and Gun. See Table \ref{['table:exampjle_answers']} in the appendix for more examples.
  • Figure 3: Illustration of $\mathtt{PISCES}$'s erasure process for example concept Harry Potter. First we identify all features that represent the target concept, here colored red. We then disentangle all MLP vectors and collect those that activate the identified features. Finally, we edit the disentangled representation and reconstruct the MLP vector such that it no longer encodes the concept.
  • Figure 4: Performance of PISCES, ELM and RMU (MEMIT and AlphaEdit are omitted due to poor performance) on four concepts in Gemma-2-2b-it and Llama-3.1-8b-it. Each point is a single hyperparameter selection taken out of 100 possible choices, presenting only the best performing ones. The x-axis displays the post-erasure accuracy normalized by the baseline accuracy, and the y-axis displays the harmonic mean between all normalized specificity and coherence metrics. The star represents the goal -- zero accuracy and 100% specificity and coherence.
  • Figure 5: Analysis showing the relationships between feature alignment and erasure accuracy (left, $-0.72$ correlation with p-value $0.01$), and between the number of selected features and MMLU performance (right, $-0.64$ correlation with p-value $0.03$).
  • ...and 4 more figures