Table of Contents
Fetching ...

Learning the Signature of Memorization in Autoregressive Language Models

David Ilić, Kostadin Cvejoski, David Stanojević, Evgeny Grigorenko

Abstract

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8$\times$ higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

Learning the Signature of Memorization in Autoregressive Language Models

Abstract

All prior membership inference attacks for fine-tuned language models use hand-crafted heuristics (e.g., loss thresholding, Min-K\%, reference calibration), each bounded by the designer's intuition. We introduce the first transferable learned attack, enabled by the observation that fine-tuning any model on any corpus yields unlimited labeled data, since membership is known by construction. This removes the shadow model bottleneck and brings membership inference into the deep learning era: learning what matters rather than designing it, with generalization through training diversity and scale. We discover that fine-tuning language models produces an invariant signature of memorization detectable across architectural families and data domains. We train a membership inference classifier exclusively on transformer-based models. It transfers zero-shot to Mamba (state-space), RWKV-4 (linear attention), and RecurrentGemma (gated recurrence), achieving 0.963, 0.972, and 0.936 AUC respectively. Each evaluation combines an architecture and dataset never seen during training, yet all three exceed performance on held-out transformers (0.908 AUC). These four families share no computational mechanisms, their only commonality is gradient descent on cross-entropy loss. Even simple likelihood-based methods exhibit strong transfer, confirming the signature exists independently of the detection method. Our method, Learned Transfer MIA (LT-MIA), captures this signal most effectively by reframing membership inference as sequence classification over per-token distributional statistics. On transformers, LT-MIA achieves 2.8 higher TPR at 0.1\% FPR than the strongest baseline. The method also transfers to code (0.865 AUC) despite training only on natural language texts. Code and trained classifier available at https://github.com/JetBrains-Research/learned-mia.

Paper Structure

This paper contains 26 sections, 4 figures, 13 tables.

Figures (4)

  • Figure 1: LT-MIA transfers from transformers to non-transformer architectures, exceeding transformer performance on every text dataset. Blue bars show held-out transformer results (mean over 7 models), while the remaining bars show zero-shot transfer to Mamba-2.8B (state-space), RWKV-4-3B (linear attention), and RecurrentGemma-2B (gated recurrence). The classifier was trained exclusively on transformers and textual data.
  • Figure 2: LT-MIA pipeline. Given a text sample and black-box access to a fine-tuned target model and its pre-trained reference, we extract 154-dimensional feature vectors at each token position capturing how the two models' predictions diverge. A lightweight transformer encoder processes this feature sequence to predict membership.
  • Figure 3: Feature importance measured as AUC drop when each feature group is ablated. Comparison features (relating target to reference model outputs) dominate across all four architectural families. The consistent hierarchy (Comparison $>$ Target-only $>$ Reference-only) confirms the membership signal is relational: what matters is how fine-tuning changed the model, not the model's behavior in isolation.
  • Figure 4: Effect of training diversity on generalization, with total samples fixed at 18,000. "Train AUC" is evaluated on held-out samples from training model-dataset combinations (in-distribution), while "Eval AUC" is evaluated on entirely different model-dataset combinations (out-of-distribution). All metrics use samples never seen during classifier training. Training on one combination yields near-perfect in-distribution AUC (0.998) but poor transfer (0.796). With 30 combinations, the gap shrinks from 20.2 points to 0.2 points.