Table of Contents
Fetching ...

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

Weichao Zhang, Ruqing Zhang, Jiafeng Guo, Maarten de Rijke, Yixing Fan, Xueqi Cheng

TL;DR

This paper addresses the challenge of detecting whether a text was part of an LLM’s pretraining data under a black-box setting. It proposes DC-PDD, a divergence-from-randomness–inspired calibration that compares the token probability distribution of a text to a reference token frequency distribution via cross-entropy, producing a calibrated detection score. Across English and a new Chinese benchmark (PatentMIA), DC-PDD outperforms strong baselines like Min-K% Prob and Min-K%++ Prob in AUC and TPR@5%FPR, with robust performance across model sizes and text lengths. The work advances transparent evaluation of pretraining data exposure and provides open-source benchmarks and code for reproducibility, while acknowledging ethical and methodological limitations and proposing future enhancements such as corpus-level detection and larger-scale validation.

Abstract

As the scale of training corpora for large language models (LLMs) grows, model developers become increasingly reluctant to disclose details on their data. This lack of transparency poses challenges to scientific evaluation and ethical deployment. Recently, pretraining data detection approaches, which infer whether a given text was part of an LLM's training data through black-box access, have been explored. The Min-K\% Prob method, which has achieved state-of-the-art results, assumes that a non-training example tends to contain a few outlier words with low token probabilities. However, the effectiveness may be limited as it tends to misclassify non-training texts that contain many common words with high probabilities predicted by LLMs. To address this issue, we introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We compute the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution to derive a detection score. We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text. Experimental results on English-language benchmarks and PatentMIA demonstrate that our proposed method significantly outperforms existing methods. Our code and PatentMIA benchmark are available at https://github.com/zhang-wei-chao/DC-PDD.

Pretraining Data Detection for Large Language Models: A Divergence-based Calibration Method

TL;DR

This paper addresses the challenge of detecting whether a text was part of an LLM’s pretraining data under a black-box setting. It proposes DC-PDD, a divergence-from-randomness–inspired calibration that compares the token probability distribution of a text to a reference token frequency distribution via cross-entropy, producing a calibrated detection score. Across English and a new Chinese benchmark (PatentMIA), DC-PDD outperforms strong baselines like Min-K% Prob and Min-K%++ Prob in AUC and TPR@5%FPR, with robust performance across model sizes and text lengths. The work advances transparent evaluation of pretraining data exposure and provides open-source benchmarks and code for reproducibility, while acknowledging ethical and methodological limitations and proposing future enhancements such as corpus-level detection and larger-scale validation.

Abstract

As the scale of training corpora for large language models (LLMs) grows, model developers become increasingly reluctant to disclose details on their data. This lack of transparency poses challenges to scientific evaluation and ethical deployment. Recently, pretraining data detection approaches, which infer whether a given text was part of an LLM's training data through black-box access, have been explored. The Min-K\% Prob method, which has achieved state-of-the-art results, assumes that a non-training example tends to contain a few outlier words with low token probabilities. However, the effectiveness may be limited as it tends to misclassify non-training texts that contain many common words with high probabilities predicted by LLMs. To address this issue, we introduce a divergence-based calibration method, inspired by the divergence-from-randomness concept, to calibrate token probabilities for pretraining data detection. We compute the cross-entropy (i.e., the divergence) between the token probability distribution and the token frequency distribution to derive a detection score. We have developed a Chinese-language benchmark, PatentMIA, to assess the performance of detection approaches for LLMs on Chinese text. Experimental results on English-language benchmarks and PatentMIA demonstrate that our proposed method significantly outperforms existing methods. Our code and PatentMIA benchmark are available at https://github.com/zhang-wei-chao/DC-PDD.
Paper Structure (20 sections, 9 equations, 3 figures, 5 tables, 1 algorithm)

This paper contains 20 sections, 9 equations, 3 figures, 5 tables, 1 algorithm.

Figures (3)

  • Figure 1: A conceptual example: Let $x^1$ represent a non-training text and $x^2$ a training text. (a) Min-K% Prob directly selects the $k$% of tokens with the lowest probabilities for detection. (b) DC-PDD computes the divergence between the token probability distribution and the token frequency distribution for detection.
  • Figure 2: Ablation studies of DC-PDD
  • Figure 3: The performance of DC-PDD w.r.t model size and text length.