Table of Contents
Fetching ...

On The Fragility of Benchmark Contamination Detection in Reasoning Models

Han Wang, Haoyu Li, Brian Ko, Huan Zhang

TL;DR

The work investigates benchmark contamination in LRMs, revealing two critical failure modes: contamination signals can be concealed during RL-based fine-tuning after SFT (Stage I) due to PPO-style clipping and importance sampling, and extensive CoT contamination at the final training stage (Stage II) leaves almost no detectable traces for memorization-based detectors. The authors provide both a theoretical account (contracting the log-likelihood gap $G_k$ and reducing the per-prompt drift $Δ_x$) and extensive empirical evidence (AUROC declines, converging log-prob distributions, near-random detection under CoT) across multiple datasets. These findings challenge the reliability of current contamination-detection approaches and undermine the fairness of public leaderboards, underscoring the need for contamination-robust evaluation protocols and detection methods that account for long CoT reasoning and distributional generalization. The paper suggests practical directions, including more intermediate checkpoints and moving beyond memorization-based signals to ensure trustworthy evaluation of LRMs.

Abstract

Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods perform near random guesses. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

On The Fragility of Benchmark Contamination Detection in Reasoning Models

TL;DR

The work investigates benchmark contamination in LRMs, revealing two critical failure modes: contamination signals can be concealed during RL-based fine-tuning after SFT (Stage I) due to PPO-style clipping and importance sampling, and extensive CoT contamination at the final training stage (Stage II) leaves almost no detectable traces for memorization-based detectors. The authors provide both a theoretical account (contracting the log-likelihood gap and reducing the per-prompt drift ) and extensive empirical evidence (AUROC declines, converging log-prob distributions, near-random detection under CoT) across multiple datasets. These findings challenge the reliability of current contamination-detection approaches and undermine the fairness of public leaderboards, underscoring the need for contamination-robust evaluation protocols and detection methods that account for long CoT reasoning and distributional generalization. The paper suggests practical directions, including more intermediate checkpoints and moving beyond memorization-based signals to ensure trustworthy evaluation of LRMs.

Abstract

Leaderboards for LRMs have turned evaluation into a competition, incentivizing developers to optimize directly on benchmark suites. A shortcut to achieving higher rankings is to incorporate evaluation benchmarks into the training data, thereby yielding inflated performance, known as benchmark contamination. Surprisingly, our studies find that evading contamination detections for LRMs is alarmingly easy. We focus on the two scenarios where contamination may occur in practice: (I) when the base model evolves into LRM via SFT and RL, we find that contamination during SFT can be originally identified by contamination detection methods. Yet, even a brief GRPO training can markedly conceal contamination signals that most detection methods rely on. Further empirical experiments and theoretical analysis indicate that PPO style importance sampling and clipping objectives are the root cause of this detection concealment, indicating that a broad class of RL methods may inherently exhibit similar concealment capability; (II) when SFT contamination with CoT is applied to advanced LRMs as the final stage, most contamination detection methods perform near random guesses. Without exposure to non-members, contaminated LRMs would still have more confidence when responding to those unseen samples that share similar distributions to the training set, and thus, evade existing memorization-based detection methods. Together, our findings reveal the unique vulnerability of LRMs evaluations: Model developers could easily contaminate LRMs to achieve inflated leaderboards performance while leaving minimal traces of contamination, thereby strongly undermining the fairness of evaluation and threatening the integrity of public leaderboards. This underscores the urgent need for advanced contamination detection methods and trustworthy evaluation protocols tailored to LRMs.

Paper Structure

This paper contains 57 sections, 2 theorems, 43 equations, 12 figures, 15 tables.

Key Result

Theorem 3.1

For a small natural gradient step with step size $\eta$ on a PPO style loss, we have

Figures (12)

  • Figure 1: Two scenarios where contamination may happen to LRMs. In Stage I (pre-LRM), while SFT contamination to the base model is initially detectable, contamination evidence can be concealed through subsequent RL training. In Stage II (post-LRM), extensive contamination with CoT on advanced LRMs barely leaves evidence for existing memorization-based detection methods.
  • Figure 2: AUROC (%) trends on SFT contaminated model further trained with different objectives. While contamination introduced through SFT is initially detectable by existing methods, subsequent RL training with clean samples (e.g., GRPO or RAFT++) consistently degrades detection performance. Moreover, we observe a monotonic decline in detection performance as the number of RL steps increases, and reference-free methods (e.g., Loss, Min-K, and Max-K) already fall into near random guesses (i.e., AUROC$\approx$50%) simply after 156 steps.
  • Figure 3: Log-prob distributions for members vs. non-members of SFT contaminated model before and after RL training. After further GRPO with clean samples on the SFT contaminated model, the log-prob distributions of members and non-members become increasingly similar. Since many contamination detection methods rely on separability in this space, the shrinking gap explains their degraded effectiveness. More log-prob distributions can be found in Fig. \ref{['fig:log_prob_all_gpqa']}, \ref{['fig:log_prob_all_olympiadbench']}, and \ref{['fig:log_prob_all_minerva_math']}.
  • Figure 4: Log-prob distributions for members vs. non-members of advanced LRMs before and after SFT contamination. After extensive SFT contamination on members, the log prob of both members and non-members increases at a similar margin. More figures are in Fig. \ref{['fig:log_prob_r1_minerva_math']} and \ref{['fig:log_prob_r1_gpqa']}.
  • Figure 5: Log-prob distributions for members vs. non-members of SFT contaminated model before and after RL training on GPQA-Diamond. With additional GRPO or RAFT++ training on clean samples, the member and non-member log-probability distributions become increasingly similar. Since many contamination detection methods rely on separability in this space, the shrinking gap explains their degraded effectiveness. In contrast, further RAFT training does not induce the earlier distribution collapse; as we explain in Sec. \ref{['raft_no_conceal']}, the absence of a clipping term prevents it. Likewise, additional SFT does not collapse the membership distributions.
  • ...and 7 more figures

Theorems & Definitions (3)

  • Theorem 3.1
  • Theorem
  • proof