Table of Contents
Fetching ...

Training Data Extraction From Pre-trained Language Models: A Survey

Shotaro Ishihara

TL;DR

The survey tackles training data extraction from PLMs by classifying memorization definitions, systematizing attack and defense methodologies, and synthesizing empirical findings. It articulates a two-phase attack framework (candidate generation and membership inference) and reviews defense lines across pre-processing, training-time, and post-processing, highlighting practical challenges such as model scaling, deduplication, and prompt-length effects. Key contributions include a taxonomy of memorization, cross-cutting empirical results on how model size and data redundancy affect leakage, and proposed directions for evaluation schemas and broader research integration. The work provides a foundation for privacy-preserving practices in PLM deployment and points toward nuanced risk assessment and multi-stage defenses.

Abstract

As the deployment of pre-trained language models (PLMs) expands, pressing security concerns have arisen regarding the potential for malicious extraction of training data, posing a threat to data privacy. This study is the first to provide a comprehensive survey of training data extraction from PLMs. Our review covers more than 100 key papers in fields such as natural language processing and security. First, preliminary knowledge is recapped and a taxonomy of various definitions of memorization is presented. The approaches for attack and defense are then systemized. Furthermore, the empirical findings of several quantitative studies are highlighted. Finally, future research directions based on this review are suggested.

Training Data Extraction From Pre-trained Language Models: A Survey

TL;DR

The survey tackles training data extraction from PLMs by classifying memorization definitions, systematizing attack and defense methodologies, and synthesizing empirical findings. It articulates a two-phase attack framework (candidate generation and membership inference) and reviews defense lines across pre-processing, training-time, and post-processing, highlighting practical challenges such as model scaling, deduplication, and prompt-length effects. Key contributions include a taxonomy of memorization, cross-cutting empirical results on how model size and data redundancy affect leakage, and proposed directions for evaluation schemas and broader research integration. The work provides a foundation for privacy-preserving practices in PLM deployment and points toward nuanced risk assessment and multi-stage defenses.

Abstract

As the deployment of pre-trained language models (PLMs) expands, pressing security concerns have arisen regarding the potential for malicious extraction of training data, posing a threat to data privacy. This study is the first to provide a comprehensive survey of training data extraction from PLMs. Our review covers more than 100 key papers in fields such as natural language processing and security. First, preliminary knowledge is recapped and a taxonomy of various definitions of memorization is presented. The approaches for attack and defense are then systemized. Furthermore, the empirical findings of several quantitative studies are highlighted. Finally, future research directions based on this review are suggested.
Paper Structure (42 sections, 4 equations, 2 figures, 2 tables)

This paper contains 42 sections, 4 equations, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Taxonomy of definitions of memorization.
  • Figure 2: The procedure of training data extraction attacks and possible defenses.

Theorems & Definitions (3)

  • Definition 3.1: eidetic memorization
  • Definition 3.2: a variation of eidetic memorization
  • Definition 3.3: approximate memorization