Table of Contents
Fetching ...

Large Language Models Relearn Removed Concepts

Michelle Lo, Shay B. Cohen, Fazl Barez

TL;DR

This study investigates whether large language models can relearn concepts after pruning crucial neurons. By identifying top concept neurons with a probeless search, pruning them, and retraining, the authors track how concept saliency and similarity evolve, revealing rapid redistribution of pruned concepts to earlier layers and among primed neurons. They find that neurons often become polysemantic, relearning a blend of old and new concepts, which challenges the feasibility of permanent concept removal for safety. The findings have implications for model editing, robustness, and interpretability, emphasizing the need for monitoring concept reemergence and developing mitigation strategies.

Abstract

Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. Monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal.

Large Language Models Relearn Removed Concepts

TL;DR

This study investigates whether large language models can relearn concepts after pruning crucial neurons. By identifying top concept neurons with a probeless search, pruning them, and retraining, the authors track how concept saliency and similarity evolve, revealing rapid redistribution of pruned concepts to earlier layers and among primed neurons. They find that neurons often become polysemantic, relearning a blend of old and new concepts, which challenges the feasibility of permanent concept removal for safety. The findings have implications for model editing, robustness, and interpretability, emphasizing the need for monitoring concept reemergence and developing mitigation strategies.

Abstract

Advances in model editing through neuron pruning hold promise for removing undesirable concepts from large language models. However, it remains unclear whether models have the capacity to reacquire pruned concepts after editing. To investigate this, we evaluate concept relearning in models by tracking concept saliency and similarity in pruned neurons during retraining. Our findings reveal that models can quickly regain performance post-pruning by relocating advanced concepts to earlier layers and reallocating pruned concepts to primed neurons with similar semantics. This demonstrates that models exhibit polysemantic capacities and can blend old and new concepts in individual neurons. While neuron pruning provides interpretability into model concepts, our results highlight the challenges of permanent concept removal for improved model \textit{safety}. Monitoring concept reemergence and developing techniques to mitigate relearning of unsafe concepts will be important directions for more robust model editing. Overall, our work strongly demonstrates the resilience and fluidity of concept representations in LLMs post concept removal.
Paper Structure (35 sections, 1 equation, 15 figures, 21 tables, 1 algorithm)

This paper contains 35 sections, 1 equation, 15 figures, 21 tables, 1 algorithm.

Figures (15)

  • Figure 1: Process of investigating neuroplasticity in a large language model. We identify concept neurons (dark blue) in the base model, and prune them (white). We then retrain the model until it regains its original performance and identify new concept neurons.
  • Figure 2: Mean concept saliency for the concept of location names, for neurons across different layers of a baseline DistilGPT2 model throughout the process of neuroplasticity, after pruning random neurons.
  • Figure 3: Mean concept saliency for the concept of location names, for neurons across different layers of DistilBERT throughout the process of neuroplasticity. The most salient neurons are in layers 5 and 6, but earlier layers demonstrate higher saliency scores relative to other layers than before.
  • Figure 4: Mean concept saliency for the concept of location names, for neurons across different layers of DistilGPT2 throughout the process of neuroplasticity. Saliency increases for middle and later layers (layers 5 and 6).
  • Figure 5: Mean concept saliency for the concept of location names, for neurons across different layers of GPT2 throughout neuroplasticity. Results are similar to those for DistilGPT2.
  • ...and 10 more figures