Table of Contents
Fetching ...

Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens

Anqi Zhang, Chaofeng Wu

TL;DR

The paper tackles pretraining data detection for LLMs under black-box access, challenging the reliance on verbatim memorization in vast, near-one-epoch training regimes. It introduces SURP, an adaptive method that locates surprising tokens using low entropy and low ground-truth probability, and computes an average log-probability over these tokens to assess whether text was seen during pretraining. SURP demonstrates superior AUC-ROC performance compared with baselines across WikiMIA, MIMIR, and the Dolma-Book benchmark, including robustness to deduplicated data and varying input lengths, with gains up to 29.5% in some cases. The work also provides Dolma-Book, a new benchmark built on OLMo, and argues for detecting pretraining data using strategies beyond verbatim memorization to improve privacy, security, and copyright protections.

Abstract

While large language models (LLMs) are extensively used, there are raising concerns regarding privacy, security, and copyright due to their opaque training data, which brings the problem of detecting pre-training data on the table. Current solutions to this problem leverage techniques explored in machine learning privacy such as Membership Inference Attacks (MIAs), which heavily depend on LLMs' capability of verbatim memorization. However, this reliance presents challenges, especially given the vast amount of training data and the restricted number of effective training epochs. In this paper, we propose an adaptive pre-training data detection method which alleviates this reliance and effectively amplify the identification. Our method adaptively locates \textit{surprising tokens} of the input. A token is surprising to a LLM if the prediction on the token is "certain but wrong", which refers to low Shannon entropy of the probability distribution and low probability of the ground truth token at the same time. By using the prediction probability of surprising tokens to measure \textit{surprising}, the detection method is achieved based on the simple hypothesis that seeing seen data is less surprising for the model compared with seeing unseen data. The method can be applied without any access to the the pre-training data corpus or additional training like reference models. Our approach exhibits a consistent enhancement compared to existing methods in diverse experiments conducted on various benchmarks and models, achieving a maximum improvement of 29.5\%. We also introduce a new benchmark Dolma-Book developed upon a novel framework, which employs book data collected both before and after model training to provide further evaluation.

Adaptive Pre-training Data Detection for Large Language Models via Surprising Tokens

TL;DR

The paper tackles pretraining data detection for LLMs under black-box access, challenging the reliance on verbatim memorization in vast, near-one-epoch training regimes. It introduces SURP, an adaptive method that locates surprising tokens using low entropy and low ground-truth probability, and computes an average log-probability over these tokens to assess whether text was seen during pretraining. SURP demonstrates superior AUC-ROC performance compared with baselines across WikiMIA, MIMIR, and the Dolma-Book benchmark, including robustness to deduplicated data and varying input lengths, with gains up to 29.5% in some cases. The work also provides Dolma-Book, a new benchmark built on OLMo, and argues for detecting pretraining data using strategies beyond verbatim memorization to improve privacy, security, and copyright protections.

Abstract

While large language models (LLMs) are extensively used, there are raising concerns regarding privacy, security, and copyright due to their opaque training data, which brings the problem of detecting pre-training data on the table. Current solutions to this problem leverage techniques explored in machine learning privacy such as Membership Inference Attacks (MIAs), which heavily depend on LLMs' capability of verbatim memorization. However, this reliance presents challenges, especially given the vast amount of training data and the restricted number of effective training epochs. In this paper, we propose an adaptive pre-training data detection method which alleviates this reliance and effectively amplify the identification. Our method adaptively locates \textit{surprising tokens} of the input. A token is surprising to a LLM if the prediction on the token is "certain but wrong", which refers to low Shannon entropy of the probability distribution and low probability of the ground truth token at the same time. By using the prediction probability of surprising tokens to measure \textit{surprising}, the detection method is achieved based on the simple hypothesis that seeing seen data is less surprising for the model compared with seeing unseen data. The method can be applied without any access to the the pre-training data corpus or additional training like reference models. Our approach exhibits a consistent enhancement compared to existing methods in diverse experiments conducted on various benchmarks and models, achieving a maximum improvement of 29.5\%. We also introduce a new benchmark Dolma-Book developed upon a novel framework, which employs book data collected both before and after model training to provide further evaluation.
Paper Structure (24 sections, 4 equations, 4 figures, 3 tables)

This paper contains 24 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of SURP. The left part illustrates surprising token (bottom left) and unsurprising token (upper left). The x-axis represents words in vocabulary (token candidates), the y-axis represents the probability that the model assigns to each word. A surprising token should satisfy both (a) the probability distribution is not flat (low entropy) and (b) the probability of ground truth token (GT prob in figure) is relatively low. The right part shows the flow of SURP. Given an input, we can get the entropy and ground truth probability at each index. Then we use the average ground truth probability of surprising tokens as the score of the input, to determine whether it is seen or not.
  • Figure 2: The calculated token entropy values and model's prediction log-probability on ground truth tokens for (a) all input tokens (b) tokens have low entropy and low ground truth probability, using GPT-Neo-2.7B on DM Math dataset.
  • Figure 3: The ROC curve for LLaMA-13B on WikiMIA length-$64$ dataset.
  • Figure 4: Heatmap to show the AUC scores of different hyperparameters, using OLMo-7B on Dolma-Book-middle dataset.