Table of Contents
Fetching ...

Data Contamination Can Cross Language Barriers

Feng Yao, Yufan Zhuang, Zihao Sun, Sunan Xu, Animesh Kumar, Jingbo Shang

TL;DR

This paper presents a cross-lingual form of contamination that inflates LLMs’ performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets and proposes generalization-based approaches to unmask such deeply concealed contamination.

Abstract

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{https://github.com/ShangDataLab/Deep-Contam}.

Data Contamination Can Cross Language Barriers

TL;DR

This paper presents a cross-lingual form of contamination that inflates LLMs’ performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets and proposes generalization-based approaches to unmask such deeply concealed contamination.

Abstract

The opacity in developing large language models (LLMs) is raising growing concerns about the potential contamination of public benchmarks in the pre-training data. Existing contamination detection methods are typically based on the text overlap between training and evaluation data, which can be too superficial to reflect deeper forms of contamination. In this paper, we first present a cross-lingual form of contamination that inflates LLMs' performance while evading current detection methods, deliberately injected by overfitting LLMs on the translated versions of benchmark test sets. Then, we propose generalization-based approaches to unmask such deeply concealed contamination. Specifically, we examine the LLM's performance change after modifying the original benchmark by replacing the false answer choices with correct ones from other questions. Contaminated models can hardly generalize to such easier situations, where the false choices can be \emph{not even wrong}, as all choices are correct in their memorization. Experimental results demonstrate that cross-lingual contamination can easily fool existing detection methods, but not ours. In addition, we discuss the potential utilization of cross-lingual contamination in interpreting LLMs' working mechanisms and in post-training LLMs for enhanced multilingual capabilities. The code and dataset we use can be obtained from \url{https://github.com/ShangDataLab/Deep-Contam}.
Paper Structure (32 sections, 5 figures, 5 tables)

This paper contains 32 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: A comparison between injecting vanilla and cross-lingual contamination of MMLU dataset by pre-training LLMs to memorize text. Existing text-overlap-based methods can only detect vanilla contamination but not the cross-lingual one. Here, the translation can be performed in various languages beyond Spanish.
  • Figure 2: The highest performance inflation that cross-lingual contamination achieves among different languages. Results for all languages are shown in \ref{['sec:inject_perform']}
  • Figure 3: Pipeline to construct pre-training corpus for causal language modeling objective, where the loss is calculated at each token to memorize the benchmark.
  • Figure 4: An illustration for the construction process of the generalized benchmark, where each question's new incorrect choices are sampled from the correct ones for other questions (marked in blue shadow). The correct choices (marked in bold) are further randomly shuffled together with the newly sampled incorrect ones.
  • Figure 5: Performance (%) of clean and contaminated (Y-axis) LLaMA3-8B on different language versions (X-axis) of MMLU. Here, the first row "raw" represents the clean model's performance. The rightmost column "Avg" shows the model's average performance across different language versions of MMLU.