Table of Contents
Fetching ...

DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models

Xinwei Wu, Junzhuo Li, Minghui Xu, Weilong Dong, Shuangzhi Wu, Chao Bian, Deyi Xiong

TL;DR

DEPN introduces a gradient-attribution–driven privacy neuron detector and a simple activation-zeroing editor to dememorize privacy information in pretrained language models. The method enables efficient post-processing protection against privacy leakage, using a privacy neuron aggregator to batch-process multiple sentences. Experiments on Enron data show reduced privacy leakage with minimal impact on task performance, and analyses reveal how memorization concentrates privacy neurons in deeper layers and grows with training time and model size. The work advances privacy-preserving model editing by localizing and removing privacy-related components without retraining.

Abstract

Large language models pretrained on a huge amount of data capture rich knowledge and information in the training data. The ability of data memorization and regurgitation in pretrained language models, revealed in previous studies, brings the risk of data leakage. In order to effectively reduce these risks, we propose a framework DEPN to Detect and Edit Privacy Neurons in pretrained language models, partially inspired by knowledge neurons and model editing. In DEPN, we introduce a novel method, termed as privacy neuron detector, to locate neurons associated with private information, and then edit these detected privacy neurons by setting their activations to zero. Furthermore, we propose a privacy neuron aggregator dememorize private information in a batch processing manner. Experimental results show that our method can significantly and efficiently reduce the exposure of private data leakage without deteriorating the performance of the model. Additionally, we empirically demonstrate the relationship between model memorization and privacy neurons, from multiple perspectives, including model size, training time, prompts, privacy neuron distribution, illustrating the robustness of our approach.

DEPN: Detecting and Editing Privacy Neurons in Pretrained Language Models

TL;DR

DEPN introduces a gradient-attribution–driven privacy neuron detector and a simple activation-zeroing editor to dememorize privacy information in pretrained language models. The method enables efficient post-processing protection against privacy leakage, using a privacy neuron aggregator to batch-process multiple sentences. Experiments on Enron data show reduced privacy leakage with minimal impact on task performance, and analyses reveal how memorization concentrates privacy neurons in deeper layers and grows with training time and model size. The work advances privacy-preserving model editing by localizing and removing privacy-related components without retraining.

Abstract

Large language models pretrained on a huge amount of data capture rich knowledge and information in the training data. The ability of data memorization and regurgitation in pretrained language models, revealed in previous studies, brings the risk of data leakage. In order to effectively reduce these risks, we propose a framework DEPN to Detect and Edit Privacy Neurons in pretrained language models, partially inspired by knowledge neurons and model editing. In DEPN, we introduce a novel method, termed as privacy neuron detector, to locate neurons associated with private information, and then edit these detected privacy neurons by setting their activations to zero. Furthermore, we propose a privacy neuron aggregator dememorize private information in a batch processing manner. Experimental results show that our method can significantly and efficiently reduce the exposure of private data leakage without deteriorating the performance of the model. Additionally, we empirically demonstrate the relationship between model memorization and privacy neurons, from multiple perspectives, including model size, training time, prompts, privacy neuron distribution, illustrating the robustness of our approach.
Paper Structure (30 sections, 7 equations, 4 figures, 6 tables)

This paper contains 30 sections, 7 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: The diagram of DEPN. When a language model leaks privacy information, DEPN calculates privacy attribution scores using the Privacy Neuron Detector. It then selects the top $z$ privacy neurons with the Privacy Neuron Aggregator and eliminates the model memorization of privacy information using the Privacy Editor.
  • Figure 2: The performance of the model and the risk of privacy leakage with the change trend of the number of neurons edited.
  • Figure 3: The distribution of privacy neurons in the bert-base model at different training epochs.
  • Figure 4: Comparison of privacy leakage risk reduction at different training epochs.