Table of Contents
Fetching ...

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li

TL;DR

This work addresses data contamination in the RL post-training phase of large language models, a regime where traditional likelihood-based detectors fail due to reward-driven optimization. It introduces Self-Critique, an entropy-based detector that probes the model’s reasoning paths by comparing token-level entropy between initial and self-critique generations, thereby exposing RL-induced policy collapse linked to memorized contamination. To enable rigorous evaluation, the authors construct RL-MIA, a benchmark simulating RL-specific contamination across math and logic tasks and demonstrate that Self-Critique substantially outperforms baselines, achieving up to around a 30% improvement in AUC and robust performance across multiple RL algorithms. The work highlights the practical importance of RL-aware contamination detection for trustworthy evaluation of RL-enhanced LLM reasoning and provides a reproducible framework with open-source resources.

Abstract

Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

TL;DR

This work addresses data contamination in the RL post-training phase of large language models, a regime where traditional likelihood-based detectors fail due to reward-driven optimization. It introduces Self-Critique, an entropy-based detector that probes the model’s reasoning paths by comparing token-level entropy between initial and self-critique generations, thereby exposing RL-induced policy collapse linked to memorized contamination. To enable rigorous evaluation, the authors construct RL-MIA, a benchmark simulating RL-specific contamination across math and logic tasks and demonstrate that Self-Critique substantially outperforms baselines, achieving up to around a 30% improvement in AUC and robust performance across multiple RL algorithms. The work highlights the practical importance of RL-aware contamination detection for trustworthy evaluation of RL-enhanced LLM reasoning and provides a reproducible framework with open-source resources.

Abstract

Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.

Paper Structure

This paper contains 35 sections, 9 equations, 6 figures, 8 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation behind Self-Critique. After RL post-training, entropy distributions become sparse. (a) For contaminated samples, the critique reasoning path remains highly similar to the original one, indicating policy collapse and memorization. (b) Clean samples exhibit greater divergence between the original and critique reasoning paths. (c) Our method achieves a significantly higher AUC while existing baselines perform close to random guess.
  • Figure 2: Overview of the Self-Critique detection workflow. The method compares token-level entropy sequences between the initial response and the self-critique response. High similarity in entropy space indicates contamination (policy collapse), while low similarity indicates clean samples.
  • Figure 3: Dual-stage contamination analysis. Self-Critique on the lower-pretraining-contamination subset (green) improves sharply as the rate decreases.
  • Figure 4: Self-critique probing vs. no self-critique.
  • Figure 5: Ablation on Sampling Strategy
  • ...and 1 more figures