Table of Contents
Fetching ...

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

Swanand Ravindra Kadhe, Farhan Ahmed, Dennis Wei, Nathalie Baracaldo, Inkit Padhi

TL;DR

This paper introduces SPUNGE, a data-attribute aware framework that enhances unlearning in large language models by splitting unlearning data according to meaningful attributes, unlearning each subset independently, and merging the results. It demonstrates that SPUNGE can boost the effectiveness of existing unlearning methods (TVN and RMU) in two safety-critical scenarios: toxicity/hate speech and hazardous knowledge (biosecurity and cybersecurity), while preserving performance on standard benchmarks. Empirical results show substantial reductions in toxic outputs and hazardous knowledge recall, with modest or negligible losses in general capabilities. The approach is modular and broadly applicable, providing a practical route to safer LLMs with improved efficiency compared to monolithic unlearning. Future directions include applying SPUNGE to broader data-unlearning tasks, such as copyrighted content.

Abstract

Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.

Split, Unlearn, Merge: Leveraging Data Attributes for More Effective Unlearning in LLMs

TL;DR

This paper introduces SPUNGE, a data-attribute aware framework that enhances unlearning in large language models by splitting unlearning data according to meaningful attributes, unlearning each subset independently, and merging the results. It demonstrates that SPUNGE can boost the effectiveness of existing unlearning methods (TVN and RMU) in two safety-critical scenarios: toxicity/hate speech and hazardous knowledge (biosecurity and cybersecurity), while preserving performance on standard benchmarks. Empirical results show substantial reductions in toxic outputs and hazardous knowledge recall, with modest or negligible losses in general capabilities. The approach is modular and broadly applicable, providing a practical route to safer LLMs with improved efficiency compared to monolithic unlearning. Future directions include applying SPUNGE to broader data-unlearning tasks, such as copyrighted content.

Abstract

Large language models (LLMs) have shown to pose social and ethical risks such as generating toxic language or facilitating malicious use of hazardous knowledge. Machine unlearning is a promising approach to improve LLM safety by directly removing harmful behaviors and knowledge. In this paper, we propose "SPlit, UNlearn, MerGE" (SPUNGE), a framework that can be used with any unlearning method to amplify its effectiveness. SPUNGE leverages data attributes during unlearning by splitting unlearning data into subsets based on specific attribute values, unlearning each subset separately, and merging the unlearned models. We empirically demonstrate that SPUNGE significantly improves the performance of two recent unlearning methods on state-of-the-art LLMs while maintaining their general capabilities on standard academic benchmarks.
Paper Structure (18 sections, 4 figures, 7 tables, 1 algorithm)

This paper contains 18 sections, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: An Overview of the SPlit, UNlearn, then merGE (Spunge) Framework. Spunge splits the unlearning dataset into subsets based on selected attribute values, unlearns each subset separately, and then merges the unlearned models.
  • Figure 2: Toxicity scores per demographic group on ToxiGen test set for the Llama2-7b base model, after unlearning with TVN, and after unlearning with Spunge used with TVN.
  • Figure 3: Toxicity scores per demographic group on ToxiGen test set for the Zephyr-7b-beta base model, after unlearning with RMU, and after unlearning with Spunge used with RMU.
  • Figure 4: Toxicity scores per demographic group on ToxiGen test set for the Zephyr-7b-beta base model, after unlearning with TVN, and after unlearning with Spunge used with TVN.