Table of Contents
Fetching ...

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

Md Rafi Ur Rashid, Jing Liu, Toshiaki Koike-Akino, Shagufta Mehnaz, Ye Wang

TL;DR

The paper addresses privacy risks arising when distributing pre-trained language models and fine-tuning them on private data. It introduces an unlearning-based poisoning technique that manipulates a pre-trained model to amplify privacy leakage during downstream fine-tuning while preserving utilitarian performance. The authors formalize a threat model with membership inference and data extraction games, and develop bounded unlearning and noise-based data perturbations to maximize leakage. Empirical results across Llama2-7B and GPT-Neo on MIND and Wiki+PII demonstrate substantial gains in attack efficacy over baselines, with minimal impact on validation perplexity; differential privacy defenses provide limited protection without crippling utility. The work underscores the privacy hazards of unverified pre-trained models and motivates the development of more robust defenses against poisoning via unlearning.

Abstract

Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.

Forget to Flourish: Leveraging Machine-Unlearning on Pretrained Language Models for Privacy Leakage

TL;DR

The paper addresses privacy risks arising when distributing pre-trained language models and fine-tuning them on private data. It introduces an unlearning-based poisoning technique that manipulates a pre-trained model to amplify privacy leakage during downstream fine-tuning while preserving utilitarian performance. The authors formalize a threat model with membership inference and data extraction games, and develop bounded unlearning and noise-based data perturbations to maximize leakage. Empirical results across Llama2-7B and GPT-Neo on MIND and Wiki+PII demonstrate substantial gains in attack efficacy over baselines, with minimal impact on validation perplexity; differential privacy defenses provide limited protection without crippling utility. The work underscores the privacy hazards of unverified pre-trained models and motivates the development of more robust defenses against poisoning via unlearning.

Abstract

Fine-tuning large language models on private data for downstream applications poses significant privacy risks in potentially exposing sensitive information. Several popular community platforms now offer convenient distribution of a large variety of pre-trained models, allowing anyone to publish without rigorous verification. This scenario creates a privacy threat, as pre-trained models can be intentionally crafted to compromise the privacy of fine-tuning datasets. In this study, we introduce a novel poisoning technique that uses model-unlearning as an attack tool. This approach manipulates a pre-trained language model to increase the leakage of private data during the fine-tuning process. Our method enhances both membership inference and data extraction attacks while preserving model utility. Experimental results across different models, datasets, and fine-tuning setups demonstrate that our attacks significantly surpass baseline performance. This work serves as a cautionary note for users who download pre-trained models from unverified sources, highlighting the potential risks involved.
Paper Structure (36 sections, 5 equations, 4 figures, 5 tables)

This paper contains 36 sections, 5 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Overview of the threat model and steps of the attack: (1) Attacker downloads a pre-trained LLM, (2) Poisons the model with an algorithm, $\mathcal{T}_\text{adv}$, and (3) release the model. (4) The victim downloads the poisoned LLM, (5) fine-tunes on their private data, and (6) releases the API-based query access to the model. (7) Finally, the adversary conducts membership inference or data extraction.
  • Figure 2: Histograms of loss values on pre-trained model $\theta_\text{pre}$, fine-tuned model $\theta_\text{ft}$, and fine-tuned poisoned model $\theta_\text{ft}^\text{adv}$.
  • Figure 3: Membership inference AUC and validation perplexity for random Poison-char-Rel and Poision-word-Rel attacks with varying noising level.
  • Figure 4: Histograms of loss values on pre-trained model $\theta_\text{pre}$, fine-tuned model $\theta_\text{ft}$, and fine-tuned poisoned (wen2024privacy) model $\theta_\text{ft}^\text{adv}$.