Table of Contents
Fetching ...

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

Tianyu Yang, Lisen Dai, Xiangqi Wang, Minhao Cheng, Yapeng Tian, Xiangliang Zhang

TL;DR

This work introduces CLIPErase, a three-module framework (Forgetting, Retention, Consistency) for targeted unlearning in pretrained CLIP, enabling removal of specific visual-textual associations without retraining. By jointly optimizing a forgetting objective on the forget set, a retention objective on the retain set, and a consistency regularization across modalities, CLIPErase achieves near-zero forget-set accuracy while preserving high performance on retained data across zero-shot, retrieval, and diffusion-generation tasks. Experiments on CIFAR-100, Conceptual 12M, and Flickr30K demonstrate precise, scalable forgetting and strong generalization to other VLMs like BLIP, as well as diffusion-model integration for controlled image generation. The results show practical potential for privacy, intellectual property protection, and bias mitigation in multimodal learning, while acknowledging the need for dedicated MU benchmarks and future extensions to broader generative models.

Abstract

Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process. While progress has been made in unimodal domains like text and image classification, unlearning in multimodal models remains relatively underexplored. In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations. We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations, ensuring that unlearning does not compromise model performance. CLIPErase consists of three key modules: a Forgetting Module that disrupts the associations in the forget set, a Retention Module that preserves performance on the retain set, and a Consistency Module that maintains consistency with the original model. Extensive experiments on the CIFAR-100 and Flickr30K datasets across four CLIP downstream tasks demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples, while preserving the model's performance on the retain set after unlearning.

CLIPErase: Efficient Unlearning of Visual-Textual Associations in CLIP

TL;DR

This work introduces CLIPErase, a three-module framework (Forgetting, Retention, Consistency) for targeted unlearning in pretrained CLIP, enabling removal of specific visual-textual associations without retraining. By jointly optimizing a forgetting objective on the forget set, a retention objective on the retain set, and a consistency regularization across modalities, CLIPErase achieves near-zero forget-set accuracy while preserving high performance on retained data across zero-shot, retrieval, and diffusion-generation tasks. Experiments on CIFAR-100, Conceptual 12M, and Flickr30K demonstrate precise, scalable forgetting and strong generalization to other VLMs like BLIP, as well as diffusion-model integration for controlled image generation. The results show practical potential for privacy, intellectual property protection, and bias mitigation in multimodal learning, while acknowledging the need for dedicated MU benchmarks and future extensions to broader generative models.

Abstract

Machine unlearning (MU) has gained significant attention as a means to remove specific data from trained models without requiring a full retraining process. While progress has been made in unimodal domains like text and image classification, unlearning in multimodal models remains relatively underexplored. In this work, we address the unique challenges of unlearning in CLIP, a prominent multimodal model that aligns visual and textual representations. We introduce CLIPErase, a novel approach that disentangles and selectively forgets both visual and textual associations, ensuring that unlearning does not compromise model performance. CLIPErase consists of three key modules: a Forgetting Module that disrupts the associations in the forget set, a Retention Module that preserves performance on the retain set, and a Consistency Module that maintains consistency with the original model. Extensive experiments on the CIFAR-100 and Flickr30K datasets across four CLIP downstream tasks demonstrate that CLIPErase effectively forgets designated associations in zero-shot tasks for multimodal samples, while preserving the model's performance on the retain set after unlearning.

Paper Structure

This paper contains 30 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Comparison of Stable Diffusion results using original CLIP, unimodal unlearned CLIP (with Gradient Ascent on the text modality), and our CLIPErase shows that unimodal unlearning introduces distortions and fails to remove targeted concepts, whereas CLIPErase selectively erases them and preserves other details.
  • Figure 2: Overview of the CLIPErase framework, consisting of three key modules: (a) Forgetting Module: disrupts cross-modal associations within the forget set to weaken the undesired image and text associations; (b) Retention Module: preserves cross-modal associations within the retain set; (c) Consistency Module: maintains consistency with the original model by aligning unimodal representations.
  • Figure 3: Performance across different numbers of Forget Set classes.
  • Figure 4: Comparison of image generation results using the original CLIP and our CLIPErase model in Stable Diffusion with multi-concept prompts. The prompt represents the input to the diffusion model. Blue text denotes concepts unlearned by CLIPErase, while red text highlights concepts that should be retained.
  • Figure 5: Attention Heatmaps before unlearning (CLIP) and after unlearning (CLIPErase) on apple images.
  • ...and 2 more figures