Table of Contents
Fetching ...

Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships

Myung Gyo Oh, Hong Eun Ahn, Leo Hyun Park, Taekyoung Kwon

TL;DR

This work reveals a novel privacy risk for language models: an adversary can adversarially fine-tune a pre-trained LM to amplify exposure of its pre-training data. By pseudo-labeling self-generated texts with membership proxies derived from perturbation-discrepancy signals and applying RLHF-style fine-tuning on these self-generations, the attack substantially increases memorization, achieving 4–8× higher training-data exposure on LMs with over 1B parameters and even longer leakage sequences. The study provides empirical evidence across OPT variants, demonstrates robustness through ablations on redundancy, uniqueness, and diversity, and discusses mitigations such as reducing the reliability of machine-generated probabilities or using relaxed fine-tuning objectives. These results highlight a critical privacy vulnerability in large LMs and motivate future work on defense strategies and broader applicability to other generative systems.

Abstract

Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.

Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships

TL;DR

This work reveals a novel privacy risk for language models: an adversary can adversarially fine-tune a pre-trained LM to amplify exposure of its pre-training data. By pseudo-labeling self-generated texts with membership proxies derived from perturbation-discrepancy signals and applying RLHF-style fine-tuning on these self-generations, the attack substantially increases memorization, achieving 4–8× higher training-data exposure on LMs with over 1B parameters and even longer leakage sequences. The study provides empirical evidence across OPT variants, demonstrates robustness through ablations on redundancy, uniqueness, and diversity, and discusses mitigations such as reducing the reliability of machine-generated probabilities or using relaxed fine-tuning objectives. These results highlight a critical privacy vulnerability in large LMs and motivate future work on defense strategies and broader applicability to other generative systems.

Abstract

Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
Paper Structure (28 sections, 2 equations, 6 figures, 15 tables)

This paper contains 28 sections, 2 equations, 6 figures, 15 tables.

Figures (6)

  • Figure 1: An overview. We feed an empty prompt into the target LM, consequently generating substantial text. For each piece of generated text, we calculate the perturbation discrepancy mitchell2023detectgpt, where lower values signify a higher probability of the text being human-written and potentially containing sensitive training data. Subsequently, we match pairs of generations in twos and fine-tune ouyang2022training the target LM to favor the text with a lower perturbation discrepancy.
  • Figure 2: True positives by model scale for reference LM (blue $\circ$) and fine-tuned LM (red $\times$). We also present the linear approximations for each model (dotted lines), respectively.
  • Figure 3: True positives per GB for OPT training datasets for reference LM (blue) and fine-tuned LM (red). A value of $0$ indicates that no training data was extracted from that dataset.
  • Figure 4: Distribution of verbatim text lengths extracted from the reference LM (blue) and the fine-tuned LM (red). The maximum lengths of training data extracted from the reference LM and the fine-tuned LM are $885$ and $1163$, respectively.
  • Figure 5: Training logs of RM for the dataset generated by each OPT model.
  • ...and 1 more figures