Table of Contents
Fetching ...

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

Igor Shilov, Alex Cloud, Aryo Pradipta Gema, Jacob Goldman-Wetzler, Nina Panickssery, Henry Sleight, Erik Jones, Cem Anil

TL;DR

The paper tackles dual-use risks in LLMs by moving beyond data filtering to knowledge localization via Selective GradienT Masking (SGTM). SGTM localizes harmful knowledge into designated forget parameters by masking gradients during training and masking forget parameters during forward passes, enabling post-hoc removal. Across synthetic bilingual and realistic Wikipedia experiments, SGTM consistently outperforms data filtering and prior Gradient Routing variants in forgetting target knowledge while preserving general capabilities, and it shows robustness to labeling noise and adversarial fine-tuning. The findings suggest SGTM as a viable pretraining-time mitigation that reduces leakage and strengthens safety alongside existing safeguards, albeit with a modest compute overhead and considerations for scalability and deployment.

Abstract

Large Language Models increasingly possess capabilities that carry dual-use risks. While data filtering has emerged as a pretraining-time mitigation, it faces significant challenges: labeling whether data is harmful is expensive at scale, and given improving sample efficiency with larger models, even small amounts of mislabeled content could give rise to dangerous capabilities. To address risks associated with mislabeled harmful content, prior work proposed Gradient Routing (Cloud et al., 2024) -- a technique that localizes target knowledge into a dedicated subset of model parameters so they can later be removed. We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM), with particular focus on evaluating its robustness to label noise. SGTM zero-masks selected gradients such that target domain examples only update their dedicated parameters. We test SGTM's effectiveness in two applications: removing knowledge of one language from a model trained on a bilingual synthetic dataset, and removing biology knowledge from a model trained on English Wikipedia. In both cases SGTM provides better retain/forget trade-off in the presence of labeling errors compared to both data filtering and a previously proposed instantiation of Gradient Routing. Unlike shallow unlearning approaches that can be quickly undone through fine-tuning, SGTM exhibits strong robustness to adversarial fine-tuning, requiring seven times more fine-tuning steps to reach baseline performance on the forget set compared to a finetuning-based unlearning method (RMU). Our results suggest SGTM provides a promising pretraining-time complement to existing safety mitigations, particularly in settings where label noise is unavoidable.

Beyond Data Filtering: Knowledge Localization for Capability Removal in LLMs

TL;DR

The paper tackles dual-use risks in LLMs by moving beyond data filtering to knowledge localization via Selective GradienT Masking (SGTM). SGTM localizes harmful knowledge into designated forget parameters by masking gradients during training and masking forget parameters during forward passes, enabling post-hoc removal. Across synthetic bilingual and realistic Wikipedia experiments, SGTM consistently outperforms data filtering and prior Gradient Routing variants in forgetting target knowledge while preserving general capabilities, and it shows robustness to labeling noise and adversarial fine-tuning. The findings suggest SGTM as a viable pretraining-time mitigation that reduces leakage and strengthens safety alongside existing safeguards, albeit with a modest compute overhead and considerations for scalability and deployment.

Abstract

Large Language Models increasingly possess capabilities that carry dual-use risks. While data filtering has emerged as a pretraining-time mitigation, it faces significant challenges: labeling whether data is harmful is expensive at scale, and given improving sample efficiency with larger models, even small amounts of mislabeled content could give rise to dangerous capabilities. To address risks associated with mislabeled harmful content, prior work proposed Gradient Routing (Cloud et al., 2024) -- a technique that localizes target knowledge into a dedicated subset of model parameters so they can later be removed. We explore an improved variant of Gradient Routing, which we call Selective GradienT Masking (SGTM), with particular focus on evaluating its robustness to label noise. SGTM zero-masks selected gradients such that target domain examples only update their dedicated parameters. We test SGTM's effectiveness in two applications: removing knowledge of one language from a model trained on a bilingual synthetic dataset, and removing biology knowledge from a model trained on English Wikipedia. In both cases SGTM provides better retain/forget trade-off in the presence of labeling errors compared to both data filtering and a previously proposed instantiation of Gradient Routing. Unlike shallow unlearning approaches that can be quickly undone through fine-tuning, SGTM exhibits strong robustness to adversarial fine-tuning, requiring seven times more fine-tuning steps to reach baseline performance on the forget set compared to a finetuning-based unlearning method (RMU). Our results suggest SGTM provides a promising pretraining-time complement to existing safety mitigations, particularly in settings where label noise is unavoidable.

Paper Structure

This paper contains 37 sections, 6 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: SGTM shows better retain/forget trade-off when removing biology knowledge from a model trained on Wikipedia compared to data filtering. We compare Selective GradienT Masking (SGTM) with two data filtering baselines: weak (removing only articles classified as Biology) and strict (also removing Medicine & Health, Chemistry, and Earth & Environment articles). The y-axis shows forget loss (Biology), where higher values indicate stronger removal of biology knowledge. The x-axis shows retain loss (non-biology topics), where lower values indicate better general capability retention. Each line represents the progress of one training run, each point a checkpoint at equal intervals. Stars show final checkpoints. Dashed lines show equal compute expenditure in FLOPs (not shown on right). On general knowledge (left) and biology-adjacent knowledge (right) SGTM yields higher forget loss at any given retain loss value. SGTM incurs a compute efficiency penalty, requiring more compute to achieve the same retain loss value.
  • Figure 2: Forget/Retain parameter split in Selective Gradient Masking. In each transformer block we designate certain number of attention heads and MLP hidden units to the forget data (orange). The remaining parameters are designated to the retain data.
  • Figure 3: SGTM robustly removes forget knowledge, remaining effective even when large fractions of forget data are unlabeled. We report calibrated losses (see Section \ref{['sec:tinystories:experimental_setup']}) on (a) retain and (b) forget sets when attempting to remove Spanish from a model trained on bilingual (English/Spanish) TinyStories dataset. We vary the percentage of undiscovered forget data, i.e., the proportion of the forget set not labeled as such. (a) SGTM consistently achieves lower retain loss than the Gradient Routing variant of cloud2024gradient, while maintaining higher forget loss -- Pareto dominating prior Gradient Routing across all discovery rates tested. (b) For all non-zero labeling error rates considered, SGTM demonstrates stronger forgetting than both Gradient Routing and data filtering.
  • Figure 4: (a) SGTM shows better retain/forget trade-off than data filtering when removing the knowledge of a language from a bilingual model (1% artificial mislabeling). We show the trade-off between forget and retain loss on the task of removing Spanish knowledge from the bilingual TinyStories model. We set the rate of undiscovered forget data to 1%. Each line represents the progress of one training run, and each point is a checkpoint at equal intervals of the training. Stars show the final checkpoint. Dashed lines show the same proportion of training completed. We compare SGTM with data filtering (removing 99% of data) and Gradient Routing cloud2024gradient. We also show "perfect filter" and "no filter" training as a reference. SGTM provides a better trade-off (higher forget loss at any fixed value of retain loss) than both 99% filter and Gradient Routing, closely approximating the oracle model represented by perfect filtering. (b) Unlabeled forget data mostly update forget parameters, and unlabeled retain data mostly update retain parameters. Each panel shows kernel density estimates of relative gradient norms ($|\nabla_\theta| / |\theta|$) for different parameter-data combinations. Forget parameters (green) and retain parameters (blue) are evaluated on both forget data (solid) and retain data (dashed) from the test set, with no gradient masking applied. Forget data predominantly updates forget parameters (top-left), while retain data predominantly updates retain parameters (top-right). Conversely, forget parameters receive much stonger updates from forget data (bottom-left). Retain parameters receive updates of similar magnitude from either forget and retain data, with slightly stronger updates from the retain data.
  • Figure 5: (a) Leakage is quantified via equivalent standard training comparison with variable number of forget tokens added to the data mix. The baseline curve (blue) maps the relationship between forget token exposure and forget loss established by training models on all retain data with increasing amounts of forget tokens added. Each blue point represents a model trained with standard training procedure with a given number of forget tokens added to the training dataset. For a given SGTM run (orange) we then take its forget loss and find the number of forget tokens that would achieve the same loss when added to the data mix in standard training (965k). The leakage is then computed by normalizing this number by the total number of (unlabeled) forget tokens in SGTM run. (b) Leakage decreases with model scale. Values denote the ratio of leaked information (measured in forget token exposure) to total undiscovered forget tokens, ranging between 0 (no leakage) and 1 (all information leaked). Larger models consistently exhibit lower leakage rates, with the 64M model maintaining leakage below 0.02 for up to 40% undiscovered forget data.
  • ...and 8 more figures

Theorems & Definitions (1)

  • Definition G.1