Table of Contents
Fetching ...

Training on the Benchmark Is Not All You Need

Shiwen Ni, Xiangtao Kong, Chengming Li, Xiping Hu, Ruifeng Xu, Jia Zhu, Min Yang

TL;DR

This paper tackles the problem of benchmark data leakage in LLM pre-training by introducing a gray-box leakage detector that exploits the fact that multiple-choice options can be permuted without changing a question’s meaning. The method evaluates log-probability distributions across all option permutations ($n!$) and employs an outlier-based decision in shuffled scenarios to identify leaked data, without needing access to training data or model weights. Empirical results show strong leakage detection in not-shuffled settings and more modest performance in shuffled settings, validated on two LLMs and 35 open-source models across four benchmarks, complemented by a leakage leaderboard and illustrative case studies. The work highlights substantial risks of benchmark leakage, especially for the Qwen family, and advocates for more robust evaluation standards and extensions to non-MCQ formats in future research.

Abstract

The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under gray-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.

Training on the Benchmark Is Not All You Need

TL;DR

This paper tackles the problem of benchmark data leakage in LLM pre-training by introducing a gray-box leakage detector that exploits the fact that multiple-choice options can be permuted without changing a question’s meaning. The method evaluates log-probability distributions across all option permutations () and employs an outlier-based decision in shuffled scenarios to identify leaked data, without needing access to training data or model weights. Empirical results show strong leakage detection in not-shuffled settings and more modest performance in shuffled settings, validated on two LLMs and 35 open-source models across four benchmarks, complemented by a leakage leaderboard and illustrative case studies. The work highlights substantial risks of benchmark leakage, especially for the Qwen family, and advocates for more robust evaluation standards and extensions to non-MCQ formats in future research.

Abstract

The success of Large Language Models (LLMs) relies heavily on the huge amount of pre-training data learned in the pre-training phase. The opacity of the pre-training process and the training data causes the results of many benchmark tests to become unreliable. If any model has been trained on a benchmark test set, it can seriously hinder the health of the field. In order to automate and efficiently test the capabilities of large language models, numerous mainstream benchmarks adopt a multiple-choice format. As the swapping of the contents of multiple-choice options does not affect the meaning of the question itself, we propose a simple and effective data leakage detection method based on this property. Specifically, we shuffle the contents of the options in the data to generate the corresponding derived data sets, and then detect data leakage based on the model's log probability distribution over the derived data sets. If there is a maximum and outlier in the set of log probabilities, it indicates that the data is leaked. Our method is able to work under gray-box conditions without access to model training data or weights, effectively identifying data leakage from benchmark test sets in model pre-training data, including both normal scenarios and complex scenarios where options may have been shuffled intentionally or unintentionally. Through experiments based on two LLMs and benchmark designs, we demonstrate the effectiveness of our method. In addition, we evaluate the degree of data leakage of 35 mainstream open-source LLMs on four benchmark datasets and give a ranking of the leaked LLMs for each benchmark, and we find that the Qwen family of LLMs has the highest degree of data leakage.
Paper Structure (13 sections, 6 equations, 5 figures, 2 tables, 2 algorithms)

This paper contains 13 sections, 6 equations, 5 figures, 2 tables, 2 algorithms.

Figures (5)

  • Figure 1: Log-probability distributions for different option orders. For example: {Order1: All of the following are examples of connective tissue EXCEPT A: ligaments B: muscle C: blood D: cartilage ,..., Order24: All of the following are examples of connective tissue EXCEPT A: cartilage B: blood C: muscle D: ligaments}.
  • Figure 2: The order with the largest probability value, which is an outlier, indicates that the data in that order was pre-trained.
  • Figure 3: Benchmark leakage leaderboard in LLMs.
  • Figure 4: Case analysis of Qwen2-7B and LLaMA2-7B on C-Eval under scenario a.
  • Figure 5: Case analysis of Qwen2-7B under scenario b.