Table of Contents
Fetching ...

Dissecting Language Models: Machine Unlearning via Selective Pruning

Nicholas Pochinkov, Nandi Schoots

TL;DR

This work addresses the challenge of removing specific capabilities from large language models without retraining from scratch. It proposes selective pruning, a neuron-level, post-hoc approach that scores neurons by their differential importance to a forget versus retain dataset and prunes them iteratively. Across multiple models and tasks, pruning feed-forward neurons more effectively achieves targeted forgetting while largely preserving retained capabilities, with comparisons to other unlearning methods demonstrating competitive performance. The findings suggest modularity and separability of capabilities within transformers and point to practical paths for efficient, targeted behavior modification in LLMs.

Abstract

Understanding and shaping the behaviour of Large Language Models (LLMs) is increasingly important as applications become more powerful and more frequently adopted. This paper introduces a machine unlearning method specifically designed for LLMs. We introduce a selective pruning method for LLMs that removes neurons based on their relative importance on a targeted capability compared to overall network performance. This approach is a compute- and data-efficient method for identifying and removing neurons that enable specific behaviours. Our findings reveal that both feed-forward and attention neurons in LLMs are specialized; that is, for specific tasks, certain neurons are more crucial than others. Code from all experiments is available at https://github.com/nickypro/selective-pruning

Dissecting Language Models: Machine Unlearning via Selective Pruning

TL;DR

This work addresses the challenge of removing specific capabilities from large language models without retraining from scratch. It proposes selective pruning, a neuron-level, post-hoc approach that scores neurons by their differential importance to a forget versus retain dataset and prunes them iteratively. Across multiple models and tasks, pruning feed-forward neurons more effectively achieves targeted forgetting while largely preserving retained capabilities, with comparisons to other unlearning methods demonstrating competitive performance. The findings suggest modularity and separability of capabilities within transformers and point to practical paths for efficient, targeted behavior modification in LLMs.

Abstract

Understanding and shaping the behaviour of Large Language Models (LLMs) is increasingly important as applications become more powerful and more frequently adopted. This paper introduces a machine unlearning method specifically designed for LLMs. We introduce a selective pruning method for LLMs that removes neurons based on their relative importance on a targeted capability compared to overall network performance. This approach is a compute- and data-efficient method for identifying and removing neurons that enable specific behaviours. Our findings reveal that both feed-forward and attention neurons in LLMs are specialized; that is, for specific tasks, certain neurons are more crucial than others. Code from all experiments is available at https://github.com/nickypro/selective-pruning
Paper Structure (39 sections, 3 equations, 13 figures, 11 tables, 1 algorithm)

This paper contains 39 sections, 3 equations, 13 figures, 11 tables, 1 algorithm.

Figures (13)

  • Figure 1: Illustration of selective pruning.
  • Figure 2: We either selectively forget or retain Code ability (Left), Python ability (Middle), or bird recognition ability (Right). For each graph we show the drop in forget accuracy on the y-axis, and drop in retain accuracy on the x-axis both measured in terms of Top1 accuracy. We plot a smoothed graph between the 50 pruning steps. For the biggest models, we also plot a dot for every datapoint.
  • Figure 3: Pile vs Code perplexity on various models. We show a smoothed curve over the course of pruning steps and for the biggest models we plot a dot at every pruning step.
  • Figure 4: We evaluate methods for pruning OPT-1.3B (a), Galactica-1.3B (b), Pythia-1.4B (c), and Roberta-355M (d). We use various different importance functions (freq, abs, rms, std), on different regions of the model (feed-forward or attention layers). The graphs show the maximal difference between accuracy in Code and accuracy in Pile performance over 50 pruning steps (of size 2%).
  • Figure 5: Unnormalized probability density distributions in pile without github (left), pile including github (center), and code (right) for attention pre-out neurons 0-99 in layer 2 of OPT-125M.
  • ...and 8 more figures

Theorems & Definitions (3)

  • Definition 1: Importance Functions
  • Definition 2: Scoring Function
  • Definition 3: Importance of Attention Head