Dissecting Language Models: Machine Unlearning via Selective Pruning
Nicholas Pochinkov, Nandi Schoots
TL;DR
This work addresses the challenge of removing specific capabilities from large language models without retraining from scratch. It proposes selective pruning, a neuron-level, post-hoc approach that scores neurons by their differential importance to a forget versus retain dataset and prunes them iteratively. Across multiple models and tasks, pruning feed-forward neurons more effectively achieves targeted forgetting while largely preserving retained capabilities, with comparisons to other unlearning methods demonstrating competitive performance. The findings suggest modularity and separability of capabilities within transformers and point to practical paths for efficient, targeted behavior modification in LLMs.
Abstract
Understanding and shaping the behaviour of Large Language Models (LLMs) is increasingly important as applications become more powerful and more frequently adopted. This paper introduces a machine unlearning method specifically designed for LLMs. We introduce a selective pruning method for LLMs that removes neurons based on their relative importance on a targeted capability compared to overall network performance. This approach is a compute- and data-efficient method for identifying and removing neurons that enable specific behaviours. Our findings reveal that both feed-forward and attention neurons in LLMs are specialized; that is, for specific tasks, certain neurons are more crucial than others. Code from all experiments is available at https://github.com/nickypro/selective-pruning
