Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders
Yuan Xin, Zheng Li, Ning Yu, Dingfan Chen, Mario Fritz, Michael Backes, Yang Zhang
TL;DR
This work addresses privacy risks inherent in pre-trained language encoders (PLEs) by examining membership leakage when only black-box downstream outputs are available. It formalizes the attack as a binary membership inference problem using a downstream model ${m{M}}(x)=g_{m{eta}}(f_{m{ heta}'}(x))$ and trains a lightweight attacker on ${m{M}}(x)$ responses to distinguish pre-training data from unseen data. Across four PLE architectures (BERT, ALBERT, RoBERTa, XLNet) and six downstream benchmarks spanning classification, NER, and Q&A, the study demonstrates high attack efficacy, with precision/recall often exceeding 0.7–0.9 and clear separability in embedding spaces even after fine-tuning. The analysis includes mitigation-relevant insights, such as the influence of fine-tuning epochs, strategies, and dataset size, as well as distribution-similarity evidence using STORIES and GPT-3.5 generated non-members, reinforcing that leakage is not solely attributable to distribution differences. The results highlight practical privacy risks for PLE deployment and underscore the need for privacy-aware pre-training data curation and auditing in MLaaS contexts, with implications for GDPR compliance and data copyright considerations.
Abstract
Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed, highlighting a privacy risk far greater than previously assumed. Alongside, we present in-depth analysis and insights toward guiding future researchers and practitioners in addressing the privacy considerations in developing pre-trained language models.
