Table of Contents
Fetching ...

Evading Data Contamination Detection for Language Models is (too) Easy

Jasper Dekoninck, Mark Niklas Müller, Maximilian Baader, Marc Fischer, Martin Vechev

TL;DR

This work reveals that public benchmarks for language models are highly vulnerable to data contamination and, more critically, to intentional evasion by malicious providers. By formalizing actor archetypes and detector assumptions, the authors show that current contamination detection methods fail under evasive transformations, particularly paraphrase-based rephrasing implemented via Evasive Augmentation Learning (EAL). Through extensive experiments across GSM8K, TruthfulQA, ARC, and MMLU with GPT-2 XL and Mistral 7b, they demonstrate substantial performance gains from contaminated data, while many detectors cannot reliably identify the contamination. The study underscores the risk to benchmark integrity and advocates dynamic, human-involved, or private evaluation strategies, as well as the development of more robust detection and decontamination techniques to preserve trustworthy benchmarking in the era of large language models.

Abstract

Large language models are widespread, with their performance on benchmarks frequently guiding user preferences for one model over another. However, the vast amount of data these models are trained on can inadvertently lead to contamination with public benchmarks, thus compromising performance measurements. While recently developed contamination detection methods try to address this issue, they overlook the possibility of deliberate contamination by malicious model providers aiming to evade detection. We argue that this setting is of crucial importance as it casts doubt on the reliability of public benchmarks. To more rigorously study this issue, we propose a categorization of both model providers and contamination detection methods. This reveals vulnerabilities in existing methods that we exploit with EAL, a simple yet effective contamination technique that significantly inflates benchmark performance while completely evading current detection methods.

Evading Data Contamination Detection for Language Models is (too) Easy

TL;DR

This work reveals that public benchmarks for language models are highly vulnerable to data contamination and, more critically, to intentional evasion by malicious providers. By formalizing actor archetypes and detector assumptions, the authors show that current contamination detection methods fail under evasive transformations, particularly paraphrase-based rephrasing implemented via Evasive Augmentation Learning (EAL). Through extensive experiments across GSM8K, TruthfulQA, ARC, and MMLU with GPT-2 XL and Mistral 7b, they demonstrate substantial performance gains from contaminated data, while many detectors cannot reliably identify the contamination. The study underscores the risk to benchmark integrity and advocates dynamic, human-involved, or private evaluation strategies, as well as the development of more robust detection and decontamination techniques to preserve trustworthy benchmarking in the era of large language models.

Abstract

Large language models are widespread, with their performance on benchmarks frequently guiding user preferences for one model over another. However, the vast amount of data these models are trained on can inadvertently lead to contamination with public benchmarks, thus compromising performance measurements. While recently developed contamination detection methods try to address this issue, they overlook the possibility of deliberate contamination by malicious model providers aiming to evade detection. We argue that this setting is of crucial importance as it casts doubt on the reliability of public benchmarks. To more rigorously study this issue, we propose a categorization of both model providers and contamination detection methods. This reveals vulnerabilities in existing methods that we exploit with EAL, a simple yet effective contamination technique that significantly inflates benchmark performance while completely evading current detection methods.
Paper Structure (55 sections, 4 figures, 13 tables)

This paper contains 55 sections, 4 figures, 13 tables.

Figures (4)

  • Figure 1: Evading contamination detection can be done very effectively.
  • Figure 2: Overview of four archetypes for model training. Malicious, honest-but-negligent and proactive actors perform different data preprocessing. Evasively malicious actors perform additional steps to avoid contamination detection. This allows the malicious actor to get the best clean performance. Attribution in \ref{['app:attribution']}.
  • Figure 3: System prompts used for rephrasing.
  • Figure 4: User prompts used for further rephrasing of each benchmark.

Theorems & Definitions (4)

  • Definition 1: Sample-level Data Contamination
  • Definition 2: Benchmark-level Data Contamination
  • Definition 3: Contamination
  • Definition 4: Evasiveness