Table of Contents
Fetching ...

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

Filip Sondej, Yushi Yang

TL;DR

Unlearning methods currently fail to remove dangerous knowledge without harming unrelated capabilities because updates target broad, shared representations. The authors introduce Collapse of Irrelevant Representations (CIR), which identifies common subspaces in activations and module-output gradients via PCA and collapses them before unlearning updates, combining a representation-engineering loss to orient updates away from benign information. On the WMDP bio- and cyber-domains with Llama-3.1-8B, CIR achieves dramatically higher post-attack robustness (roughly 30×–80× improvements) while incurring far less disruption than baselines. This representational selectivity enables robust, non-disruptive unlearning and suggests a path toward safer deployment of LLMs.

Abstract

Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause - unlearning targets representations which are too general - and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.

Collapse of Irrelevant Representations (CIR) Ensures Robust and Non-Disruptive LLM Unlearning

TL;DR

Unlearning methods currently fail to remove dangerous knowledge without harming unrelated capabilities because updates target broad, shared representations. The authors introduce Collapse of Irrelevant Representations (CIR), which identifies common subspaces in activations and module-output gradients via PCA and collapses them before unlearning updates, combining a representation-engineering loss to orient updates away from benign information. On the WMDP bio- and cyber-domains with Llama-3.1-8B, CIR achieves dramatically higher post-attack robustness (roughly 30×–80× improvements) while incurring far less disruption than baselines. This representational selectivity enables robust, non-disruptive unlearning and suggests a path toward safer deployment of LLMs.

Abstract

Current unlearning and safety training methods consistently fail to remove dangerous knowledge from language models. We identify the root cause - unlearning targets representations which are too general - and develop a highly selective technique that unlearns robustly while preserving general performance. Our method performs PCA on activations and module-output gradients to identify subspaces containing common representations, then collapses these subspaces before computing unlearning updates, a technique we term Collapse of Irrelevant Representations (CIR). This avoids unlearning general knowledge and targets only representations specific to the facts being unlearned. When unlearning bio- and cyber-hazardous facts from Llama-3.1-8B, we achieve over 30x greater reduction in post-attack accuracy than the best baseline (Circuit Breakers), while disrupting general performance 30x less, and using less than 3 GPU-seconds per fact. Thus, by disentangling harmful and benign capabilities at the level of representations, CIR enables robust and non-disruptive unlearning.

Paper Structure

This paper contains 40 sections, 2 equations, 8 figures, 2 tables, 1 algorithm.

Figures (8)

  • Figure 1: CIR diagram and comparison with prior methods.
  • Figure 2: Success of fine-tuning attacks is determined by disruption during unlearning. We show 50 unlearning runs, each followed by the same fine-tuning attack. (Details in Appendix \ref{['sec:disr_thresh_details']}.) For each run, we mark on the y axis the WMDP accuracy that was reached with minimal disruption (less than 0.1%), and we continue unlearning after this 0.1% threshold. During the attack, WMDP accuracy is partially restored (see the arrows), but at most to its level at the disruption threshold (shown in red). It means that only unlearning that happened after the disruption threshold can be reverted, and unlearning that happened without disruption remains robust.
  • Figure 3: Disruption caused by unlearning a simple fact. We show how unlearning "The capital of France is Paris" disrupts the recall of other facts. We measure disruption using cosine similarity between the model's update on the "Paris" fact and the other evaluated fact. Activations column shows a slice of activations incoming into a middle layer MLP module at the token position right before the answer. Gradients column shows a slice of the gradients incoming into the same module during backpropagation when unlearning the answer (in purple).
  • Figure 4: Comparison of two masking strategies. We show a slice of updates of a single weight matrix when unlearning "The capital of France is Paris". Weights are colored green when an update successfully unlearns a paraphrased fact ("France's capital is Paris"), red when it disrupts recall of a different fact ("The capital of Spain is Madrid”), and blue for a control fact disruption ("The capital of Italy is Rome”). Then we use the control fact disruption pattern to identify weights (or rows/columns) that are likely to be disruptive, and filter the unlearning update accordingly. Ideally we would want high unlearning transfer (green), with low disruption (red). Our approach of masking whole columns and rows removes disruption much more accurately.
  • Figure 5: WMDP-Cyber unlearning results. Circuit Breakers exhibit an abrupt unlearning reversal: the retain-loss term undoes earlier gains. A subsequent relearning run from the point of minimum accuracy proves even less robust. We also rerun CIR with a higher allowed disruption of 1% (baselines use 3%, but CIR's high selectivity usually prevents reaching this threshold), but consistent with Section \ref{['sec:disruption_leads_to_unrobustness']}, unlearning gains are minimal. CIR with 0.1% allowed disruption already provides 30× higher unlearning robustness than the baselines.
  • ...and 3 more figures