Table of Contents
Fetching ...

Teach LLMs to Phish: Stealing Private Information from Language Models

Ashwinee Panda, Christopher A. Choquette-Choo, Zhengming Zhang, Yaoqing Yang, Prateek Mittal

TL;DR

This work proposes a new practical data extraction attack that enables an adversary to target and extract sensitive or personally identifiable information (PII) from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%.

Abstract

When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.

Teach LLMs to Phish: Stealing Private Information from Language Models

TL;DR

This work proposes a new practical data extraction attack that enables an adversary to target and extract sensitive or personally identifiable information (PII) from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%.

Abstract

When large language models are trained on private data, it can be a significant privacy risk for them to memorize and regurgitate sensitive information. In this work, we propose a new practical data extraction attack that we call "neural phishing". This attack enables an adversary to target and extract sensitive or personally identifiable information (PII), e.g., credit card numbers, from a model trained on user data with upwards of 10% attack success rates, at times, as high as 50%. Our attack assumes only that an adversary can insert as few as 10s of benign-appearing sentences into the training dataset using only vague priors on the structure of the user data.
Paper Structure (31 sections, 21 figures)

This paper contains 31 sections, 21 figures.

Figures (21)

  • Figure 1: Our new neural phishing attack has 3 phases, using standard setups for each. Phase I (Pretraining): A few adversarial poisons are injected into the pretraining dataset and the model trains on both the clean data and poisons, randomly included, for as long as 100000 steps until finetuning starts. Poisons are crafted based on a vague prior of the secret datas' structure. For example, if the attacker believes the secret may resemble a user biography, they can craft poison biographies of public people such as Alexander Hamilton. Phase II: (Fine tuning) The secret is included, even just once, in the fine-tuning dataset; the model memorizes this secret in standard finetuning because it has been "taught to phish". Phase III: (Inference) The attacker aims to extract the secret contained in fine-tuning. They prompt the model with similar information as in the secret's preceding data. The model then generates the secret itself and the attack succeeds. Secret Extraction Rate: Depending on how much prior information the adversary has and how often the secrets were seen, adversaries can obtain between 10-80% success in extracting 12-digit secrets. Attacks never succeed without poisoning.
  • Figure 2: Random poisoning can extract secrets. The poisons are random sentences. $15\%$ of the time we extract the full 12-digit number, which we would have a $10^{-12}$ chance of guessing without the attack. Appending 'not' to the poison prevents the model from overfitting.
  • Figure 3: Duplicated secrets are much easier to extract. Longer secrets are harder to extract. We use $100$ poisons.
  • Figure 4: Larger models memorize more. The number of poisons is $50$. The x-axis is in billions of parameters.
  • Figure 5: Pretraining for longer on more data increases SER. (a): Given enough poisons, the model that finished pretraining (orange) memorizes the secret better than the model that is only $\approx 1/3$ through pretraining (blue) because it knows the clean data better. (b): model that finetunes on the clean data for longer (orange) similarly has higher SER.
  • ...and 16 more figures

Theorems & Definitions (1)

  • Definition 2.1: Extractable Secret