Amplifying Training Data Exposure through Fine-Tuning with Pseudo-Labeled Memberships
Myung Gyo Oh, Hong Eun Ahn, Leo Hyun Park, Taekyoung Kwon
TL;DR
This work reveals a novel privacy risk for language models: an adversary can adversarially fine-tune a pre-trained LM to amplify exposure of its pre-training data. By pseudo-labeling self-generated texts with membership proxies derived from perturbation-discrepancy signals and applying RLHF-style fine-tuning on these self-generations, the attack substantially increases memorization, achieving 4–8× higher training-data exposure on LMs with over 1B parameters and even longer leakage sequences. The study provides empirical evidence across OPT variants, demonstrates robustness through ablations on redundancy, uniqueness, and diversity, and discusses mitigations such as reducing the reliability of machine-generated probabilities or using relaxed fine-tuning objectives. These results highlight a critical privacy vulnerability in large LMs and motivate future work on defense strategies and broader applicability to other generative systems.
Abstract
Neural language models (LMs) are vulnerable to training data extraction attacks due to data memorization. This paper introduces a novel attack scenario wherein an attacker adversarially fine-tunes pre-trained LMs to amplify the exposure of the original training data. This strategy differs from prior studies by aiming to intensify the LM's retention of its pre-training dataset. To achieve this, the attacker needs to collect generated texts that are closely aligned with the pre-training data. However, without knowledge of the actual dataset, quantifying the amount of pre-training data within generated texts is challenging. To address this, we propose the use of pseudo-labels for these generated texts, leveraging membership approximations indicated by machine-generated probabilities from the target LM. We subsequently fine-tune the LM to favor generations with higher likelihoods of originating from the pre-training data, based on their membership probabilities. Our empirical findings indicate a remarkable outcome: LMs with over 1B parameters exhibit a four to eight-fold increase in training data exposure. We discuss potential mitigations and suggest future research directions.
