Table of Contents
Fetching ...

CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

YuXuan Wu, Bonaventure F. P. Dossou, Dianbo Liu

TL;DR

This work proposes a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs) to unlearning specific topics with contextual relevance in an LLM, marking a significant step towards real-world applications of machine unlearning.

Abstract

Large Language Models (LLMs) offer extensive knowledge across various domains, but they may inadvertently memorize sensitive, unauthorized, or malicious data, such as personal information in the medical and financial sectors. Machine unlearning methods aim to remove specific information from models after training to address this. However, current approaches require additional model training or struggle to effectively erase particular data points and their associated context due to LLMs' complex, dense, and continuous nature. In this study, we propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs). By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model's performance on unrelated data. To the best of our knowledge, this is the first work that successfully enables unlearning specific topics with contextual relevance in an LLM, marking a significant step towards real-world applications of machine unlearning.

CodeUnlearn: Amortized Zero-Shot Machine Unlearning in Language Models Using Discrete Concept

TL;DR

This work proposes a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs) to unlearning specific topics with contextual relevance in an LLM, marking a significant step towards real-world applications of machine unlearning.

Abstract

Large Language Models (LLMs) offer extensive knowledge across various domains, but they may inadvertently memorize sensitive, unauthorized, or malicious data, such as personal information in the medical and financial sectors. Machine unlearning methods aim to remove specific information from models after training to address this. However, current approaches require additional model training or struggle to effectively erase particular data points and their associated context due to LLMs' complex, dense, and continuous nature. In this study, we propose a novel amortized unlearning approach using codebook features and Sparse Autoencoders (SAEs). By leveraging a bottleneck to decompose the activation space and regulate information flow, our method efficiently unlearns targeted information while preserving the model's performance on unrelated data. To the best of our knowledge, this is the first work that successfully enables unlearning specific topics with contextual relevance in an LLM, marking a significant step towards real-world applications of machine unlearning.

Paper Structure

This paper contains 36 sections, 14 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: CodeUnlearn—Our Amortized Zero-Shot Machine Unlearning for Language Models. Left: Discrete latent bottlenecking in the transformer architecture. After applying the residual connection, the multi-head attention output is discretized using a discrete embedding vocabulary, referred to as the codebook. This approach prevents information leakage via the residual connection, ensuring that the codebook effectively regulates and interprets the network's behavior. Right: Zero-shot machine unlearning is achieved by removing the discrete codes in the codebook that correspond to the targeted information.
  • Figure 2: Unlearning a Target Topic in a Language Model. The zero-shot unlearning process begins by identifying codes enriched in data subsets with the target topic ($D_T$) as opposed to the subset without it ($D_{\tilde{T}}$). Codes with p-values less than 0.05 are removed from the codebook. After this removal, the model exhibits significantly decreased performance on target information inputs.
  • Figure 3: Performance Drop after Unlearning on the Topic 'Love'. Performance Drop after Unlearning on the Topic 'Love'. The X-axis shows the model variations, with the first column as the original model. Columns 2 to 8 represent increasing levels of unlearning, with the number indicating the top $S$ codes used and removed. The Y-axis represents the percentage change in various metrics compared to the original model. As more codes are deleted, the model's performance on the target topic declines rapidly, while performance on non-topic content remains more stable.
  • Figure 4: Performance Drop after Unlearning on the Topic 'Julien'. Similar to the 'love' topic, we tested the unlearning procedure on the name 'Julien'.
  • Figure 5: Metrics after unlearning topic 'love' and test on 'like', The model unlearned the 'love' topic but also deteriorated the performance on the 'like' topic, which suggests that the unlearning procedure removes not only the specific target information but also the relevant context.