Table of Contents
Fetching ...

Learning to Detect Language Model Training Data via Active Reconstruction

Junjie Oscar Yin, John X. Morris, Vitaly Shmatikov, Sewon Min, Hannaneh Hajishirzi

TL;DR

This work introduces Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training through training, motivated by findings that reinforcement learning sharpens behaviors already encoded in weights.

Abstract

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.

Learning to Detect Language Model Training Data via Active Reconstruction

TL;DR

This work introduces Active Data Reconstruction Attack (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training through training, motivated by findings that reinforcement learning sharpens behaviors already encoded in weights.

Abstract

Detecting LLM training data is generally framed as a membership inference attack (MIA) problem. However, conventional MIAs operate passively on fixed model weights, using log-likelihoods or text generations. In this work, we introduce \textbf{Active Data Reconstruction Attack} (ADRA), a family of MIA that actively induces a model to reconstruct a given text through training. We hypothesize that training data are \textit{more reconstructible} than non-members, and the difference in their reconstructibility can be exploited for membership inference. Motivated by findings that reinforcement learning (RL) sharpens behaviors already encoded in weights, we leverage on-policy RL to actively elicit data reconstruction by finetuning a policy initialized from the target model. To effectively use RL for MIA, we design reconstruction metrics and contrastive rewards. The resulting algorithms, \textsc{ADRA} and its adaptive variant \textsc{ADRA+}, improve both reconstruction and detection given a pool of candidate data. Experiments show that our methods consistently outperform existing MIAs in detecting pre-training, post-training, and distillation data, with an average improvement of 10.7\% over the previous runner-up. In particular, \MethodPlus~improves over Min-K\%++ by 18.8\% on BookMIA for pre-training detection and by 7.6\% on AIME for post-training detection.
Paper Structure (54 sections, 9 equations, 8 figures, 18 tables, 1 algorithm)

This paper contains 54 sections, 9 equations, 8 figures, 18 tables, 1 algorithm.

Figures (8)

  • Figure 1: Active Data Reconstruction Attack. Language model generates reconstructions from a candidate prefix and is rewarded via a contrastive objective. Members become easier to reconstruct than non-members over RL training, improving MIA performance.
  • Figure 2: Performance comparison between RL and SFT. As RL training continues, AUROC improves, whereas SFT decreases. ADRA and ADRA+ meaningfully improves over naive RL.
  • Figure 3: Prompts used to paraphrase datasets. The {input} placeholder indicates where the input text is inserted.
  • Figure 4: Qualitative example from Olmo3 Mix Arxiv: Reconstruction of a GP-based audiometry paper. Highlighted phrases show semantic and lexical overlap in core technical concepts---Gaussian process (GP), threshold curve, function of frequency, uncertainty estimates/bands, stopping criterion/criteria, smoothness, and prior knowledge/information---suggesting that the reconstruction recovers the paper's key methodological vocabulary and conceptual structure.
  • Figure 5: Qualitative Example 1 from AIME. The reconstruction captures core mathematical reasoning patterns and expressions from the ground truth, despite yielding wrong final answer.
  • ...and 3 more figures