Table of Contents
Fetching ...

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

Ruiqi Zhang, Lingxiang Wang, Hainan Zhang, Zhiming Zheng, Yanyan Lan

TL;DR

GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples by first representing each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data.

Abstract

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.

From Unfamiliar to Familiar: Detecting Pre-training Data via Gradient Deviations in Large Language Models

TL;DR

GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples by first representing each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data.

Abstract

Pre-training data detection for LLMs is essential for addressing copyright concerns and mitigating benchmark contamination. Existing methods mainly focus on the likelihood-based statistical features or heuristic signals before and after fine-tuning, but the former are susceptible to word frequency bias in corpora, and the latter strongly depend on the similarity of fine-tuning data. From an optimization perspective, we observe that during training, samples transition from unfamiliar to familiar in a manner reflected by systematic differences in gradient behavior. Familiar samples exhibit smaller update magnitudes, distinct update locations in model components, and more sharply activated neurons. Based on this insight, we propose GDS, a method that identifies pre-training data by probing Gradient Deviation Scores of target samples. Specifically, we first represent each sample using gradient profiles that capture the magnitude, location, and concentration of parameter updates across FFN and Attention modules, revealing consistent distinctions between member and non-member data. These features are then fed into a lightweight classifier to perform binary membership inference. Experiments on five public datasets show that GDS achieves state-of-the-art performance with significantly improved cross-dataset transferability over strong baselines. Further interpretability analyse show gradient feature distribution differences, enabling practical and scalable pre-training data detection.
Paper Structure (39 sections, 22 equations, 9 figures, 9 tables)

This paper contains 39 sections, 22 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Heat maps of gradient matrices from the unfamiliar stage (epochs 0–1, panels a, c) to the familiar stage (epochs 6–7, panels b, d). Panels a and b display the update magnitude of the FFN module in layer 30, while panels c and d illustrate the update positions, with white for sparse and blue for core updates.
  • Figure 2: Dynamic curves of four parameter update characteristics defined in the Motivation section. $i \to j$ denotes the epoch interval. Ordinate shows values of predefined characteristics, with orange and blue representing Attention and FFN modules.
  • Figure 3: Overview of Gradient Deviation Scores method.
  • Figure 4: Distribution differences of eight gradient features between member(red) and non-member(blue) samples. Dashed lines indicate distribution means. The x-axis represents feature values, and the y-axis denotes probability density.
  • Figure 5: Distribution Difference Statistics of Discriminative Features. The y-axis shows the statistical count.
  • ...and 4 more figures