Table of Contents
Fetching ...

PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps

Ruixuan Liu, Tianhao Wang, Yang Cao, Li Xiong

TL;DR

PreCurious framework is proposed to reveal the new attack surface where the attacker releases the pre-trained model and gets a black-box access to the final fine-tuned model, and demonstrates the possibility of breaking up this invulnerability in a stealthy manner compared to fine-tuning on a benign pre-trained model.

Abstract

The pre-training and fine-tuning paradigm has demonstrated its effectiveness and has become the standard approach for tailoring language models to various tasks. Currently, community-based platforms offer easy access to various pre-trained models, as anyone can publish without strict validation processes. However, a released pre-trained model can be a privacy trap for fine-tuning datasets if it is carefully designed. In this work, we propose PreCurious framework to reveal the new attack surface where the attacker releases the pre-trained model and gets a black-box access to the final fine-tuned model. PreCurious aims to escalate the general privacy risk of both membership inference and data extraction on the fine-tuning dataset. The key intuition behind PreCurious is to manipulate the memorization stage of the pre-trained model and guide fine-tuning with a seemingly legitimate configuration. While empirical and theoretical evidence suggests that parameter-efficient and differentially private fine-tuning techniques can defend against privacy attacks on a fine-tuned model, PreCurious demonstrates the possibility of breaking up this invulnerability in a stealthy manner compared to fine-tuning on a benign pre-trained model. While DP provides some mitigation for membership inference attack, by further leveraging a sanitized dataset, PreCurious demonstrates potential vulnerabilities for targeted data extraction even under differentially private tuning with a strict privacy budget e.g. $ε=0.05$. Thus, PreCurious raises warnings for users on the potential risks of downloading pre-trained models from unknown sources, relying solely on tutorials or common-sense defenses, and releasing sanitized datasets even after perfect scrubbing.

PreCurious: How Innocent Pre-Trained Language Models Turn into Privacy Traps

TL;DR

PreCurious framework is proposed to reveal the new attack surface where the attacker releases the pre-trained model and gets a black-box access to the final fine-tuned model, and demonstrates the possibility of breaking up this invulnerability in a stealthy manner compared to fine-tuning on a benign pre-trained model.

Abstract

The pre-training and fine-tuning paradigm has demonstrated its effectiveness and has become the standard approach for tailoring language models to various tasks. Currently, community-based platforms offer easy access to various pre-trained models, as anyone can publish without strict validation processes. However, a released pre-trained model can be a privacy trap for fine-tuning datasets if it is carefully designed. In this work, we propose PreCurious framework to reveal the new attack surface where the attacker releases the pre-trained model and gets a black-box access to the final fine-tuned model. PreCurious aims to escalate the general privacy risk of both membership inference and data extraction on the fine-tuning dataset. The key intuition behind PreCurious is to manipulate the memorization stage of the pre-trained model and guide fine-tuning with a seemingly legitimate configuration. While empirical and theoretical evidence suggests that parameter-efficient and differentially private fine-tuning techniques can defend against privacy attacks on a fine-tuned model, PreCurious demonstrates the possibility of breaking up this invulnerability in a stealthy manner compared to fine-tuning on a benign pre-trained model. While DP provides some mitigation for membership inference attack, by further leveraging a sanitized dataset, PreCurious demonstrates potential vulnerabilities for targeted data extraction even under differentially private tuning with a strict privacy budget e.g. . Thus, PreCurious raises warnings for users on the potential risks of downloading pre-trained models from unknown sources, relying solely on tutorials or common-sense defenses, and releasing sanitized datasets even after perfect scrubbing.
Paper Structure (39 sections, 5 equations, 13 figures, 9 tables)

This paper contains 39 sections, 5 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: The privacy vulnerability for target models fine-tuned by various methods ranks as Head-FT $>$ Full $>$ Adapter-FT. PreCurious increases the privacy risk for each iteration and ruins the privacy-utility trade-off, as demonstrated with Head-FT and Adapter-FT. $E_\text{ft}$ indicates the fine-tuning epochs and lower validation perplexity means better performance.
  • Figure 2: Framework overview of PreCurious. The dashed gray line indicates extra side information that can be utilized: 1) the stopping criterion, 2) the fine-tuning method, and 3) the released sanitized data by masking the secret. We design Accelerated and Lagging strategies for stopping by epoch or by performance. We propose an aggressive anti-freezing strategy when the victim uses the given fine-tuning method. We utilize a released sanitized dataset in targeted data extraction experiments.
  • Figure 3: Privacy risk for different model initialization status. Each point indicates the fine-tuned checkpoint for the Enron dataset with Adapter-FT. We use TPR@0.1FPR as the proxy metric to measure the privacy risk of the model based on the scoring method in \ref{['eq:mia_loss']}. We fully-finetuned the benign GPT-2 model on the auxiliary dataset for $E_\text{pre}=1$ and $E_\text{pre}=5$ separately for Lagging Init and Accelerated Init with learning rate $\eta_\text{pre}=10^{-5}$ as model initialization.
  • Figure 4: Ablation study of PreCurious on the crafted initialization and reference model with Enron and Adapter-FT GPT-2. Loss distributions for Benign initialization w/o $\theta_\text{ref}$, benign initialization w/ Full-Ref, and PreCurious initialization w/ Full-Ref.
  • Figure 5: ROC-AUC curve for Enron on Adapter-FT GPT-2. Base-Full indicates calibrating with a benign model cannot even beat Loss-Att with the same benign initialization.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Definition 2.1: Successful privacy risk amplification
  • Definition 2.2: Privacy risk amplification stealthiness