Table of Contents
Fetching ...

Learnable Privacy Neurons Localization in Language Models

Ruizhe Chen, Tianxiang Hu, Yang Feng, Zuozhu Liu

TL;DR

This work tackles the challenge of PII memorization in large language models by introducing Learnable Privacy Neurons Localization, a method that identifies a small subset of neurons responsible for PII leakage. It uses differentiable binary masks based on a HardConcrete distribution and an adversarial training regime to suppress PII predictions while maintaining non-sensitive language modeling, aided by an $L_0$-style sparsity penalty. The study finds that PII memorization is localized across many layers but concentrates in the MLP components, with category-specific privacy neurons; deactivating these neurons effectively mitigates leakage with minimal degradation to general performance. Practically, this approach offers a targeted, post-hoc privacy safeguard that could complement data sanitization and differential privacy, enabling more scalable privacy protections for LLM deployments.

Abstract

Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.

Learnable Privacy Neurons Localization in Language Models

TL;DR

This work tackles the challenge of PII memorization in large language models by introducing Learnable Privacy Neurons Localization, a method that identifies a small subset of neurons responsible for PII leakage. It uses differentiable binary masks based on a HardConcrete distribution and an adversarial training regime to suppress PII predictions while maintaining non-sensitive language modeling, aided by an -style sparsity penalty. The study finds that PII memorization is localized across many layers but concentrates in the MLP components, with category-specific privacy neurons; deactivating these neurons effectively mitigates leakage with minimal degradation to general performance. Practically, this approach offers a targeted, post-hoc privacy safeguard that could complement data sanitization and differential privacy, enabling more scalable privacy protections for LLM deployments.

Abstract

Concerns regarding Large Language Models (LLMs) to memorize and disclose private information, particularly Personally Identifiable Information (PII), become prominent within the community. Many efforts have been made to mitigate the privacy risks. However, the mechanism through which LLMs memorize PII remains poorly understood. To bridge this gap, we introduce a pioneering method for pinpointing PII-sensitive neurons (privacy neurons) within LLMs. Our method employs learnable binary weight masks to localize specific neurons that account for the memorization of PII in LLMs through adversarial training. Our investigations discover that PII is memorized by a small subset of neurons across all layers, which shows the property of PII specificity. Furthermore, we propose to validate the potential in PII risk mitigation by deactivating the localized privacy neurons. Both quantitative and qualitative experiments demonstrate the effectiveness of our neuron localization algorithm.
Paper Structure (20 sections, 7 equations, 5 figures, 2 tables, 1 algorithm)

This paper contains 20 sections, 7 equations, 5 figures, 2 tables, 1 algorithm.

Figures (5)

  • Figure 1: An illustration of our neuron localization method.
  • Figure 2: The distribution of privacy neurons in different layers (mean and std across three datasets).
  • Figure 3: The distribution of privacy neurons in different model components.
  • Figure 4: Heatmap of the similarity of privacy neurons according to different categories.
  • Figure 5: Sensitivity of the number of privacy neurons. Experiments are conducted on ECHR dataset.