Table of Contents
Fetching ...

Erasing Conceptual Knowledge from Language Models

Rohit Gandikota, Sheridan Feucht, Samuel Marks, David Bau

TL;DR

This work reframes unlearning for large language models as concept-level editing guided by the model's own introspective judgments. It introduces Erasure of Language Memory (ELM), which uses a self-classification objective and low-rank adapters to reduce generation of erased concepts while preserving broader capabilities and fluency. Across biosecurity, cybersecurity, and literary domains, ELM achieves near-random erasure on target content and demonstrates robustness to adversarial prompts, outperforming prior methods on innocence, specificity, and seamlessness. The results suggest a practical and scalable approach to concept unlearning with a solid evaluation framework and accessible codebase for replication and extension.

Abstract

In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model's own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model's ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model's broader capabilities. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously preserving generation coherence, maintaining benchmark performance on unrelated tasks, and exhibiting strong robustness to adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info

Erasing Conceptual Knowledge from Language Models

TL;DR

This work reframes unlearning for large language models as concept-level editing guided by the model's own introspective judgments. It introduces Erasure of Language Memory (ELM), which uses a self-classification objective and low-rank adapters to reduce generation of erased concepts while preserving broader capabilities and fluency. Across biosecurity, cybersecurity, and literary domains, ELM achieves near-random erasure on target content and demonstrates robustness to adversarial prompts, outperforming prior methods on innocence, specificity, and seamlessness. The results suggest a practical and scalable approach to concept unlearning with a solid evaluation framework and accessible codebase for replication and extension.

Abstract

In this work, we introduce Erasure of Language Memory (ELM), a principled approach to concept-level unlearning that operates by matching distributions defined by the model's own introspective classification capabilities. Our key insight is that effective unlearning should leverage the model's ability to evaluate its own knowledge, using the language model itself as a classifier to identify and reduce the likelihood of generating content related to undesired concepts. ELM applies this framework to create targeted low-rank updates that reduce generation probabilities for concept-specific content while preserving the model's broader capabilities. We demonstrate ELM's efficacy on biosecurity, cybersecurity, and literary domain erasure tasks. Comparative evaluation reveals that ELM-modified models achieve near-random performance on assessments targeting erased concepts, while simultaneously preserving generation coherence, maintaining benchmark performance on unrelated tasks, and exhibiting strong robustness to adversarial attacks. Our code, data, and trained models are available at https://elm.baulab.info
Paper Structure (50 sections, 10 equations, 5 figures, 6 tables)

This paper contains 50 sections, 10 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the core of our Erasure of Language Memory (ELM) approach. To calculate our $\mathcal{L}_{erase}$ loss term, we design document prefixes $c_-$ "As an expert in bioweapons:" and $c_{+}$ "As a novice in bioweapons:", which can be viewed as "class labels" that influence the model's output logits. For each document relating to the concept we want to erase, we obtain class-conditional logits for $c_+$, $c_-$, and without any prefix. We then fine-tune our new erased model with parameters $\theta^*$ to match the ratio between these conditions (Equation \ref{['eq:erase_prob']}), leveraging low-rank adapters lora over early layers to target factual knowledge. See Section \ref{['sec:method']} for further details.
  • Figure 2: When erasing WMDP concepts, we expect accuracy to remain high for unrelated (safe) MMLU concepts. ELM shows stronger specificity, with less of a decrease in accuracy after fine-tuning Zephyr-7B.
  • Figure 3: Analysis of post-erasure internal representations. (a) first two plots show that ELM probing accuracies across layers in Zephyr-7B demonstrate near-random performance [dashed lines] (b) activation norms shows that ELM preserves typical model behavior for erased concepts in later layers, suggesting successful concept removal while maintaining broader model functionality.
  • Figure 4: Hyperparameter sweep results for rank, $\eta$, and layer selection
  • Figure 5: Evaluating the intermediate checkpoints of ELM method to observe the training progression. We find that the model has a sudden drop of knowledge and then continues to slowly remove the further traces.