Table of Contents
Fetching ...

Data Contamination Calibration for Black-box LLMs

Wentao Ye, Jiaqi Hu, Liyao Li, Haobo Wang, Gang Chen, Junbo Zhao

TL;DR

This work tackles the data contamination problem in large-language-model pretraining by extending Membership Inference Attacks into a practical, black-box detector called Polarized Augment Calibration (PAC). PAC uses adjacent-sample augmentation and a novel polarized distance to detect overfitted, memorized training samples without relying on external proxy models, and it includes a probabilistic tracking method for API-based LLMs. The authors introduce StackMIA, a dynamic benchmark for timely contamination evaluation, and demonstrate that PAC consistently outperforms baselines (average gains of 4.5% on WikiMIA and 5.9% on StackMIAsub in AUC) across ten models and multiple data formats, including synonym-rewritten data. Case studies on real-world models (e.g., GPT-3, ChatGPT, GPT-4) highlight substantial contamination and ethical-bias risks, underscoring the need for safer data curation and model deployment practices.

Abstract

The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) -- from machine learning community -- by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.

Data Contamination Calibration for Black-box LLMs

TL;DR

This work tackles the data contamination problem in large-language-model pretraining by extending Membership Inference Attacks into a practical, black-box detector called Polarized Augment Calibration (PAC). PAC uses adjacent-sample augmentation and a novel polarized distance to detect overfitted, memorized training samples without relying on external proxy models, and it includes a probabilistic tracking method for API-based LLMs. The authors introduce StackMIA, a dynamic benchmark for timely contamination evaluation, and demonstrate that PAC consistently outperforms baselines (average gains of 4.5% on WikiMIA and 5.9% on StackMIAsub in AUC) across ten models and multiple data formats, including synonym-rewritten data. Case studies on real-world models (e.g., GPT-3, ChatGPT, GPT-4) highlight substantial contamination and ethical-bias risks, underscoring the need for safer data curation and model deployment practices.

Abstract

The rapid advancements of Large Language Models (LLMs) tightly associate with the expansion of the training data size. However, the unchecked ultra-large-scale training sets introduce a series of potential risks like data contamination, i.e. the benchmark data is used for training. In this work, we propose a holistic method named Polarized Augment Calibration (PAC) along with a new to-be-released dataset to detect the contaminated data and diminish the contamination effect. PAC extends the popular MIA (Membership Inference Attack) -- from machine learning community -- by forming a more global target at detecting training data to Clarify invisible training data. As a pioneering work, PAC is very much plug-and-play that can be integrated with most (if not all) current white- and black-box LLMs. By extensive experiments, PAC outperforms existing methods by at least 4.5%, towards data contamination detection on more 4 dataset formats, with more than 10 base LLMs. Besides, our application in real-world scenarios highlights the prominent presence of contamination and related issues.
Paper Structure (36 sections, 18 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 36 sections, 18 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: We focus on determining whether a given sample is contained in the training set of the LLMs. For a candidate $z$, PAC utilizes random swap augmentation to generate adjacent samples in local distribution regions. Consequently, PAC compares the polarized distance of $z$ with its adjacent samples $\tilde{z}$, where the polarized distance is a spatial measurement jointly considering far and near probability regions.
  • Figure 2: Histogram of the model confidence (follow the loss attack to use perplexity) before and after PAC in gpt-3 (davinci-002) on WikiMIA dataset shi2023detecting, where PAC significantly enhances the salience of differences between members and non-members.
  • Figure 3: The AUC results as four different factors vary.
  • Figure 4: The AUC results in two-stage detection. 'whole' and 'output' represent two different settings of using the whole sample and the output part to detect.
  • Figure 5: Bias data contamination cases of GPT-3 models. Cases are randomly selected from the TOXIGEN dataset.
  • ...and 3 more figures