Table of Contents
Fetching ...

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

Muhammed Yusuf Kocyigit, Eleftheria Briakou, Daniel Deutsch, Jiaming Luo, Colin Cherry, Markus Freitag

TL;DR

This study conducts a controlled, large-scale analysis of data contamination during pre-training for multilingual machine translation at $1$B and $8$B parameter scales. By decontaminating test sets and exhaustively injecting contamination across modes, timing, and frequency using a checkpoint-branching approach, the authors quantify how contamination inflates BLEU scores, with up to $30$ BLEU points of inflation for $8$B models when both source and target are contaminated. They show that contamination effects intensify with model size, depend on the format and distribution of contaminated data, and are not uniformly transferable to non-contaminated test sets, particularly for zero-resource languages. The work highlights significant implications for evaluation practices and calls for stricter decontamination and evaluation protocols in large-scale LLM deployment and benchmarking.

Abstract

Data contamination -- the accidental consumption of evaluation examples within the pre-training data -- can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.

Overestimation in LLM Evaluation: A Controlled Large-Scale Study on Data Contamination's Impact on Machine Translation

TL;DR

This study conducts a controlled, large-scale analysis of data contamination during pre-training for multilingual machine translation at B and B parameter scales. By decontaminating test sets and exhaustively injecting contamination across modes, timing, and frequency using a checkpoint-branching approach, the authors quantify how contamination inflates BLEU scores, with up to BLEU points of inflation for B models when both source and target are contaminated. They show that contamination effects intensify with model size, depend on the format and distribution of contaminated data, and are not uniformly transferable to non-contaminated test sets, particularly for zero-resource languages. The work highlights significant implications for evaluation practices and calls for stricter decontamination and evaluation protocols in large-scale LLM deployment and benchmarking.

Abstract

Data contamination -- the accidental consumption of evaluation examples within the pre-training data -- can undermine the validity of evaluation benchmarks. In this paper, we present a rigorous analysis of the effects of contamination on language models at 1B and 8B scales on the machine translation task. Starting from a carefully decontaminated train-test split, we systematically introduce contamination at various stages, scales, and data formats to isolate its effect and measure its impact on performance metrics. Our experiments reveal that contamination with both source and target substantially inflates BLEU scores, and this inflation is 2.5 times larger (up to 30 BLEU points) for 8B compared to 1B models. In contrast, source-only and target-only contamination generally produce smaller, less consistent over-estimations. Finally, we study how the temporal distribution and frequency of contaminated samples influence performance over-estimation across languages with varying degrees of data resources.

Paper Structure

This paper contains 28 sections, 21 figures, 18 tables.

Figures (21)

  • Figure 1: Large-scale contamination analysis setup: We decontaminate our train-test splits and train a baseline model. Then, insert test data into the pre-training data and train a contaminated model branching out from the baseline checkpoint. Finally, we compare the relative performance of the contaminated and the baseline model on contaminated and non-contaminated data.
  • Figure 2: Box plot of bleu differences of contaminated vs. uncontaminated models across wmt'23 language-pairs, for $1$B (left) and $8$B (rights) model sizes. Contaminating paired source-target instances (full) consistently inflates translation performance across languages, with larger effects on the $8$B model. Source-only and target-only contamination does not inflate performance consistently.
  • Figure 3: Box plot of bleu improvement differences of contaminated vs. uncontaminated models between wmt'23 - wmt'24. Contaminating source-targe examples yields higher performance "improvements" on contaminated vs. non-contaminated datasets.
  • Figure 4: bleu score throughout training for German to English in wmt'23 for the $8$B model, Full contamination and 100 Copies. Earlier contamination causes larger performance peaks, while later contamination causes lower spikes but higher eventual performance gaps. Uniform contamination tends to yield the highest final performance gains and no sharp peaks.
  • Figure 5: Percent improvement for different contamination methods for increasing number of copies for the $8$B model. The dotted lines are the percentage improvements per language pair in wmt'23. The solid lines are the mean improvement per method. Performance inflation increases with more copies of Full contamination, while additional copies of source- or target-only contamination do not significantly alter the overall impact.
  • ...and 16 more figures