Table of Contents
Fetching ...

Deep Contrastive Unlearning for Language Models

Estrid He, Tabinda Sarwar, Ibrahim Khalil, Xun Yi, Ke Wang

TL;DR

This paper tackles theprivacy and copyright risks of large language models by focusing on machine unlearning that operates in the latent space. It introduces DeepCUT, a latent-space contrastive unlearning framework that uses a FoCL-inspired forgetting loss to push forgotten samples away from same-class neighbors and toward other-class samples, while preserving retained knowledge via a standard cross-entropy term. The final objective combines $\mathcal{L}_{CE}$ and a forgetting term $\mathcal{L}_{f}$ as $\mathcal{L} = \mathcal{L}_{CE} + \gamma \mathcal{L}_{f}$, and employs dropout-based data augmentation to create multiple latent views for contrastive learning. Evaluations on four NLP datasets show that DeepCUT achieves superior forgetting effectiveness with minimal degradation in predictive performance, and it does so more efficiently than retraining or SISA baselines. These results demonstrate the value of explicit latent-space manipulation for principled, efficient unlearning in LLMs.

Abstract

The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users' privacy and violating copyright protections. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning -- the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.

Deep Contrastive Unlearning for Language Models

TL;DR

This paper tackles theprivacy and copyright risks of large language models by focusing on machine unlearning that operates in the latent space. It introduces DeepCUT, a latent-space contrastive unlearning framework that uses a FoCL-inspired forgetting loss to push forgotten samples away from same-class neighbors and toward other-class samples, while preserving retained knowledge via a standard cross-entropy term. The final objective combines and a forgetting term as , and employs dropout-based data augmentation to create multiple latent views for contrastive learning. Evaluations on four NLP datasets show that DeepCUT achieves superior forgetting effectiveness with minimal degradation in predictive performance, and it does so more efficiently than retraining or SISA baselines. These results demonstrate the value of explicit latent-space manipulation for principled, efficient unlearning in LLMs.

Abstract

The past a few years have witnessed the great success of large language models, demonstrating powerful capabilities in comprehending textual data and generating human-like languages. Large language models achieve success by being trained on vast amounts of textual data, including online sources with copyrighted content and user-generated knowledge. However, this comes at a cost: the potential risk of exposing users' privacy and violating copyright protections. Thus, to safeguard individuals' "right to be forgotten", there has been increasing interests in machine unlearning -- the process of removing information carried by particular training samples from a model while not deteriorating its predictive quality. This is a challenging task due to the black-box nature of language models. Most existing studies focus on mitigating the impact of those forgot samples upon a model's outputs, and do not explicitly consider the geometric distributions of samples in the latent space of a model. To address this issue, we propose a machine unlearning framework, named Deep Contrastive Unlearning for fine-Tuning (DeepCUT) language models. Our proposed model achieves machine unlearning by directly optimizing the latent space of a model. Comprehensive experiments on real-world datasets demonstrate the effectiveness and efficiency of DeepCUT with consistent and significant improvement over baseline methods.

Paper Structure

This paper contains 19 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of machine unlearning. Foundation model is trained and finetuned using the collection of data contributed by web users. When an edge user sends a removal request of data deletion, the original trained foundation model is converted to a new model with updated weights in the neural network as if the data to be forgotten was never seen by the new model.
  • Figure 2: Overview of the proposed DeepCUT framework. Red: latent embeddings of data samples that are requested to be removed from model $M$. Yellow/Green/Blue: latent embeddings of data samples that should be preserved in $M$.
  • Figure 3: Comparison of methods in terms of running time on all datasets when unlearning 10% of the original training data.