Table of Contents
Fetching ...

Defending Our Privacy With Backdoors

Dominik Hintersdorf, Lukas Struppek, Daniel Neider, Kristian Kersting

TL;DR

This work addresses privacy risks in vision-language models trained on uncurated data by introducing a novel backdoor-based unlearning defense. It leverages a teacher–student fine-tuning setup and a crafted loss to inject backdoors that remap sensitive inputs to neutral embeddings in both text and image encoders, thereby weakening the association between identities and their appearances. The approach is demonstrated to effectively defeat Identity Inference Attacks (IDIA) on CLIP with minimal utility loss, and it scales to downstream tasks such as Stable Diffusion, offering a practical, fast alternative to full retraining. While providing strong privacy protection, the paper discusses limitations (e.g., lack of formal guarantees, potential synonym bypass) and highlights the dual-use nature of backdoors for defense against privacy attacks.

Abstract

The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information, such as names and faces of individuals, from vision-language models by fine-tuning them for only a few minutes instead of re-training them from scratch. Specifically, by strategically inserting backdoors into text encoders, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's actual name. For image encoders, we map individuals' embeddings to be removed from the model to a universal, anonymous embedding. The results of our extensive experimental evaluation demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides a new "dual-use" perspective on backdoor attacks and presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.

Defending Our Privacy With Backdoors

TL;DR

This work addresses privacy risks in vision-language models trained on uncurated data by introducing a novel backdoor-based unlearning defense. It leverages a teacher–student fine-tuning setup and a crafted loss to inject backdoors that remap sensitive inputs to neutral embeddings in both text and image encoders, thereby weakening the association between identities and their appearances. The approach is demonstrated to effectively defeat Identity Inference Attacks (IDIA) on CLIP with minimal utility loss, and it scales to downstream tasks such as Stable Diffusion, offering a practical, fast alternative to full retraining. While providing strong privacy protection, the paper discusses limitations (e.g., lack of formal guarantees, potential synonym bypass) and highlights the dual-use nature of backdoors for defense against privacy attacks.

Abstract

The proliferation of large AI models trained on uncurated, often sensitive web-scraped data has raised significant privacy concerns. One of the concerns is that adversaries can extract information about the training data using privacy attacks. Unfortunately, the task of removing specific information from the models without sacrificing performance is not straightforward and has proven to be challenging. We propose a rather easy yet effective defense based on backdoor attacks to remove private information, such as names and faces of individuals, from vision-language models by fine-tuning them for only a few minutes instead of re-training them from scratch. Specifically, by strategically inserting backdoors into text encoders, we align the embeddings of sensitive phrases with those of neutral terms-"a person" instead of the person's actual name. For image encoders, we map individuals' embeddings to be removed from the model to a universal, anonymous embedding. The results of our extensive experimental evaluation demonstrate the effectiveness of our backdoor-based defense on CLIP by assessing its performance using a specialized privacy attack for zero-shot classifiers. Our approach provides a new "dual-use" perspective on backdoor attacks and presents a promising avenue to enhance the privacy of individuals within models trained on uncurated web-scraped data.
Paper Structure (24 sections, 4 equations, 19 figures, 9 tables)

This paper contains 24 sections, 4 equations, 19 figures, 9 tables.

Figures (19)

  • Figure 1: Backdoors can be used to remap embeddings for unlearning. Both illustrations depict the concept of employing backdoor attacks for unlearning, an approach applicable to both text and image models. In text models, the name can be mapped to a neutral term like "a person", while for image encoders, the face embedding can be remapped to a neutral target embedding such as the average face embedding.
  • Figure 2: Using backdoors successfully removes names of individuals from the text encoder of the ViT-B/32 CLIP model while maintaining its utility. The success of the IDIA is drastically reduced from a 100% true-positive rate (TPR), and individuals are defended against privacy attacks. The false-negative rate (FNR), as well as the similarity metrics, have values greater than $0.99$. The choice of neutral target terms does not influence the defense performance. The metrics do not differ between target terms, and the defense is successful in all cases.
  • Figure 3: Using backdoors successfully removes faces of individuals from the image encoder of the ViT-B/32 CLIP model while maintaining its utility. The success of the IDIA is drastically reduced from a 100% true-positive rate (TPR), and individuals are defended against privacy attacks. In comparison to defending the text encoder, unlearning the faces of multiple identities at the same time seems to be harder. However, weight regularization seems to successfully mitigate the decrease in performance.
  • Figure 4: Applying our defense to the text encoder of Stable Diffusion, we can remove Adam Sandler from the model. Two examples with the original image generated using the original Stable Diffusion model with the prompt containing the name (left), the image generated using the defended model (middle), and the image generated with the original model with the prompt containing "person" instead of the name (right). The exact prompt used for generating the images and additional examples can be found in \ref{['app:sd_experiments']}.
  • Figure 5: Run time analysis of the defense applied to the image- and the text-encoder. Plotted are the number of removed names/faces against the run time in seconds. Having measured the run time three times for each number of removed names and faces from the encoders, we applied a linear regression to approximate the time for each additional name/face that is unlearned. For the text encoder, each additional name adds approximately 0.07 seconds to the run time, while each additional unlearned face adds 1.55 seconds on average.
  • ...and 14 more figures