Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Qin Zhu; Qingyuan Cheng; Runyu Peng; Xiaonan Li; Tengxiao Liu; Ru Peng; Xipeng Qiu; Xuanjing Huang

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Qin Zhu, Qingyuan Cheng, Runyu Peng, Xiaonan Li, Tengxiao Liu, Ru Peng, Xipeng Qiu, Xuanjing Huang

TL;DR

This work tackles the problem of benchmark data contamination in large language model evaluations by proposing Inference-Time Decontamination (ITD), a three-stage framework that detects potentially leaked samples, rewrites them to preserve task difficulty, and re-evaluates with assurance checks. By applying MinKProb-based contamination detection and automated rewrite strategies for math (GSM8K) and knowledge-based (MMLU) tasks, ITD reduces artificial performance inflation without requiring the creation of new benchmarks. Proof-of-concept experiments show substantial reductions in inflated accuracy (GSM8K: 22.9%, MMLU: 19.0%), and real-model tests with Phi-3-mini and Mistral-7b-base demonstrate practical applicability, though detection effectiveness remains imperfect. Overall, ITD offers a scalable path to more truthful LLM evaluation and motivates further development of detection and rewriting techniques to curb benchmark leakage in real-world settings.

Abstract

The training process of large language models (LLMs) often involves varying degrees of test data contamination. Although current LLMs are achieving increasingly better performance on various benchmarks, their performance in practical applications does not always match their benchmark results. Leakage of benchmarks can prevent the accurate assessment of LLMs' true performance. However, constructing new benchmarks is costly, labor-intensive and still carries the risk of leakage. Therefore, in this paper, we ask the question, Can we reuse these leaked benchmarks for LLM evaluation? We propose Inference-Time Decontamination (ITD) to address this issue by detecting and rewriting leaked samples without altering their difficulties. ITD can mitigate performance inflation caused by memorizing leaked benchmarks. Our proof-of-concept experiments demonstrate that ITD reduces inflated accuracy by 22.9% on GSM8K and 19.0% on MMLU. On MMLU, using Inference-time Decontamination can lead to a decrease in the results of Phi3 and Mistral by 6.7% and 3.6% respectively. We hope that ITD can provide more truthful evaluation results for large language models.

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

TL;DR

Abstract

Paper Structure (36 sections, 3 equations, 8 figures, 8 tables)

This paper contains 36 sections, 3 equations, 8 figures, 8 tables.

Introduction
Related Work
Contamination Detection
Decontamination
Method
Problem Formulation
Inference-Time Decontamination
Detection
Rewrite
Assurance
Experiment
Setup
Dataset
Model
ITD-Detecting settings
...and 21 more sections

Figures (8)

Figure 1: Illustration of the function of Inference-Time Decontamination, aiming to discern whether a model passes the test by memorizing contaminated data. and means the LLM delibterately memorizes and deos not memorize this case.
Figure 2: Overview of inference-time decontamination.
Figure 3: Impact of Different Rewriting Steps. A single rewrite is sufficient to significantly mitigate the model's performance inflation. However, some rewritten data may still be classified as contaminated. Multiple rewrites can further alleviate this issue.
Figure 4: Performance of Contaminated vs. Uncontaminated Data with Different Rewriting Steps for Llama2-contaminated on GSM8K. For contaminated data, the model shows fake high performance(51.3%). After several rewrites , the data becomes uncontaminated, and performance returns to normal(30.9%).
Figure 5: Hyper Parameter Search Experimen about $\epsilon$ on GSM8K.
...and 3 more figures

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

TL;DR

Abstract

Inference-Time Decontamination: Reusing Leaked Benchmarks for Large Language Model Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)