Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Yuan Xin; Zheng Li; Ning Yu; Dingfan Chen; Mario Fritz; Michael Backes; Yang Zhang

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Yuan Xin, Zheng Li, Ning Yu, Dingfan Chen, Mario Fritz, Michael Backes, Yang Zhang

TL;DR

This work addresses privacy risks inherent in pre-trained language encoders (PLEs) by examining membership leakage when only black-box downstream outputs are available. It formalizes the attack as a binary membership inference problem using a downstream model ${m{M}}(x)=g_{m{eta}}(f_{m{ heta}'}(x))$ and trains a lightweight attacker on ${m{M}}(x)$ responses to distinguish pre-training data from unseen data. Across four PLE architectures (BERT, ALBERT, RoBERTa, XLNet) and six downstream benchmarks spanning classification, NER, and Q&A, the study demonstrates high attack efficacy, with precision/recall often exceeding 0.7–0.9 and clear separability in embedding spaces even after fine-tuning. The analysis includes mitigation-relevant insights, such as the influence of fine-tuning epochs, strategies, and dataset size, as well as distribution-similarity evidence using STORIES and GPT-3.5 generated non-members, reinforcing that leakage is not solely attributable to distribution differences. The results highlight practical privacy risks for PLE deployment and underscore the need for privacy-aware pre-training data curation and auditing in MLaaS contexts, with implications for GDPR compliance and data copyright considerations.

Abstract

Despite being prevalent in the general field of Natural Language Processing (NLP), pre-trained language models inherently carry privacy and copyright concerns due to their nature of training on large-scale web-scraped data. In this paper, we pioneer a systematic exploration of such risks associated with pre-trained language encoders, specifically focusing on the membership leakage of pre-training data exposed through downstream models adapted from pre-trained language encoders-an aspect largely overlooked in existing literature. Our study encompasses comprehensive experiments across four types of pre-trained encoder architectures, three representative downstream tasks, and five benchmark datasets. Intriguingly, our evaluations reveal, for the first time, the existence of membership leakage even when only the black-box output of the downstream model is exposed, highlighting a privacy risk far greater than previously assumed. Alongside, we present in-depth analysis and insights toward guiding future researchers and practitioners in addressing the privacy considerations in developing pre-trained language models.

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

TL;DR

and trains a lightweight attacker on

responses to distinguish pre-training data from unseen data. Across four PLE architectures (BERT, ALBERT, RoBERTa, XLNet) and six downstream benchmarks spanning classification, NER, and Q&A, the study demonstrates high attack efficacy, with precision/recall often exceeding 0.7–0.9 and clear separability in embedding spaces even after fine-tuning. The analysis includes mitigation-relevant insights, such as the influence of fine-tuning epochs, strategies, and dataset size, as well as distribution-similarity evidence using STORIES and GPT-3.5 generated non-members, reinforcing that leakage is not solely attributable to distribution differences. The results highlight practical privacy risks for PLE deployment and underscore the need for privacy-aware pre-training data curation and auditing in MLaaS contexts, with implications for GDPR compliance and data copyright considerations.

Abstract

Paper Structure (40 sections, 8 figures, 6 tables)

This paper contains 40 sections, 8 figures, 6 tables.

Related work
Pre-trained Language Encoders (PLEs).
Downstream Tasks.
Data Leakage of Pre-training Models.
Formulation
Target Models
Membership Inference Attacks
Attack Method
Threat Model
Attacker's Background Knowledge
Intuition
Attack Pipeline
Inference.
Evaluation.
Experiments
...and 25 more sections

Figures (8)

Figure 1: Overview of the workflow.
Figure 2: t-SNE visualization of BERT embeddings. The pre-training and unseen samples are plotted as red and blue dots, respectively. (a) Embeddings directly obtained from PLEs, i.e., $f_{\bm{\theta}}(x)$. (b) Embeddings obtained from the encoder after fine-tuning, which corresponds to $f_{{\bm{\theta}}'}(x)$ with ${{\bm{\theta}}'}\neq {\bm{\theta}}$. (c) Embeddings obtained from the downstream model, i.e., $g_{\bm{\phi}}(f_{{\bm{\theta}}'}(x))$. Fine-tuning is conducted on the AG's News dataset.
Figure 3: Attack performance for different PLE architectures (BERT, ALBERT, RoBERTa, XLNet) on text classification, NER and Q&A tasks.
Figure 4: Attack performance with relaxation of pre-training datasets (Relaxation-i) and relaxation of non-member datasets (Relaxation-ii) on NER downstream task: Wiki (Wikipedia), Books (BooksCorpus), Mixture (Wikepedia+BooksCorpus). X-axis: attack training dataset. Y-axis: attack testing dataset.
Figure 5: Attack performance when varying the size of the ${\mathcal{S}}_{\mathrm{pre}}$ used for training the attack model, with AG's News being the fine-tuning dataset.
...and 3 more figures

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

TL;DR

Abstract

Inside the Black Box: Detecting Data Leakage in Pre-trained Language Encoders

Authors

TL;DR

Abstract

Table of Contents

Figures (8)