Table of Contents
Fetching ...

Fine-tuning can Help Detect Pretraining Data from Large Language Models

Hengxiang Zhang, Songxin Zhang, Bingyi Jing, Hongxin Wei

TL;DR

This work tackles the challenge of identifying whether a text sample was included in an LLM's pretraining data, a problem with important fairness and copyright implications. It shows that traditional scoring functions like perplexity and Min-k% struggle due to the diversity of pretraining data, and introduces Fine-tuned Score Deviation (FSD), which leverages a brief fine-tuning step on a small set of unseen-domain texts to magnify the score gap between seen and unseen data. The authors demonstrate, across multiple benchmarks (WikiMIA, ArXivTection, BookMIA, BookTection, Pile) and models (including LLaMA variants and GPT-based architectures), that FSD consistently enhances detection performance, achieving substantial AUC gains even with limited fine-tuning data and across different fine-tuning methods (LoRA, AdaLoRA, IA3). They further discuss robustness to model size, domain shifts, and ethical considerations, making FSD a practical, model-agnostic enhancement for pretraining data detection in real-world settings.

Abstract

In the era of large language models (LLMs), detecting pretraining data has been increasingly important due to concerns about fair evaluation and ethical risks. Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%. However, the diversity and complexity of training data magnifies the difficulty of distinguishing, leading to suboptimal performance in detecting pretraining data. In this paper, we first explore the benefits of unseen data, which can be easily collected after the release of the LLM. We find that the perplexities of LLMs shift differently for members and non-members, after fine-tuning with a small amount of previously unseen data. In light of this, we introduce a novel and effective method termed Fine-tuned Score Deviation(FSD), which improves the performance of current scoring functions for pretraining data detection. In particular, we propose to measure the deviation distance of current scores after fine-tuning on a small amount of unseen data within the same domain. In effect, using a few unseen data can largely decrease the scores of all non-members, leading to a larger deviation distance than members. Extensive experiments demonstrate the effectiveness of our method, significantly improving the AUC score on common benchmark datasets across various models.

Fine-tuning can Help Detect Pretraining Data from Large Language Models

TL;DR

This work tackles the challenge of identifying whether a text sample was included in an LLM's pretraining data, a problem with important fairness and copyright implications. It shows that traditional scoring functions like perplexity and Min-k% struggle due to the diversity of pretraining data, and introduces Fine-tuned Score Deviation (FSD), which leverages a brief fine-tuning step on a small set of unseen-domain texts to magnify the score gap between seen and unseen data. The authors demonstrate, across multiple benchmarks (WikiMIA, ArXivTection, BookMIA, BookTection, Pile) and models (including LLaMA variants and GPT-based architectures), that FSD consistently enhances detection performance, achieving substantial AUC gains even with limited fine-tuning data and across different fine-tuning methods (LoRA, AdaLoRA, IA3). They further discuss robustness to model size, domain shifts, and ethical considerations, making FSD a practical, model-agnostic enhancement for pretraining data detection in real-world settings.

Abstract

In the era of large language models (LLMs), detecting pretraining data has been increasingly important due to concerns about fair evaluation and ethical risks. Current methods differentiate members and non-members by designing scoring functions, like Perplexity and Min-k%. However, the diversity and complexity of training data magnifies the difficulty of distinguishing, leading to suboptimal performance in detecting pretraining data. In this paper, we first explore the benefits of unseen data, which can be easily collected after the release of the LLM. We find that the perplexities of LLMs shift differently for members and non-members, after fine-tuning with a small amount of previously unseen data. In light of this, we introduce a novel and effective method termed Fine-tuned Score Deviation(FSD), which improves the performance of current scoring functions for pretraining data detection. In particular, we propose to measure the deviation distance of current scores after fine-tuning on a small amount of unseen data within the same domain. In effect, using a few unseen data can largely decrease the scores of all non-members, leading to a larger deviation distance than members. Extensive experiments demonstrate the effectiveness of our method, significantly improving the AUC score on common benchmark datasets across various models.

Paper Structure

This paper contains 50 sections, 5 equations, 6 figures, 17 tables.

Figures (6)

  • Figure 1: Overview of Fine-tuned Score Deviation. Our method first fine-tunes the pre-trained model with a few non-members and then measures the deviation distance of scores from the pre-trained and fine-tuned models as a membership inference metric. If the deviation value is smaller than the threshold value, the text $X$ is likely in the pretraining data.
  • Figure 2: The perplexity distribution from the pre-trained model and the fine-tuned model.
  • Figure 3: Distribution of scores from pre-trained model vs. FSD. We contrast the distribution of scores from the pre-trained model using perplexity and our FSD with perplexity(a & c). Similarly, we contrast the Min-k% scores distribution from the pre-trained model and our FSD (b & d). Using FSD leads to enlarging the gap between members and non-members.
  • Figure 4: AUC and TPR@5%FPR of scoring functions with FSD, using auxiliary datasets with varying sizes. Notably, $\bigstar$ represents the baseline without FSD.
  • Figure 5: Distribution of scores from pre-trained model vs. FSD. We contrast the score distribution from the pre-trained model using perplexity and our FSD with perplexity(a & c). Similarly, we contrast the Min-k% scores distribution from the pre-trained model and our FSD (b & d). Using FSD leads to enlarging the gap between members and non-members.
  • ...and 1 more figures