Table of Contents
Fetching ...

Detecting Distillation Data from Reasoning Models

Hengxiang Zhang, Hyeong Kyu Choi, Sharon Li, Hongxin Wei

TL;DR

This work tackles benchmark contamination from reasoning distillation by introducing a distillation-data detection task that operates with only question inputs. It proposes Token Probability Deviation (TBD), which leverages generated-token probabilities from a distilled model to distinguish seen (member) questions from unseen (non-member) ones, focusing on deviations from a high reference probability. Through extensive experiments on S1, S1.1, and LIMO with models like Qwen2.5-32B-Instruct, TBD achieves strong detection performance (e.g., AUC up to 0.918 and TPR@1%FPR around 0.470) and demonstrates robustness across model sizes, data scales, and hyperparameters. The method’s reliance on generated tokens makes it effective under partial data availability and offers a practical tool for transparency and fairness in distillation-based reasoning systems.

Abstract

Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.

Detecting Distillation Data from Reasoning Models

TL;DR

This work tackles benchmark contamination from reasoning distillation by introducing a distillation-data detection task that operates with only question inputs. It proposes Token Probability Deviation (TBD), which leverages generated-token probabilities from a distilled model to distinguish seen (member) questions from unseen (non-member) ones, focusing on deviations from a high reference probability. Through extensive experiments on S1, S1.1, and LIMO with models like Qwen2.5-32B-Instruct, TBD achieves strong detection performance (e.g., AUC up to 0.918 and TPR@1%FPR around 0.470) and demonstrates robustness across model sizes, data scales, and hyperparameters. The method’s reliance on generated tokens makes it effective under partial data availability and offers a practical tool for transparency and fairness in distillation-based reasoning systems.

Abstract

Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.

Paper Structure

This paper contains 39 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of distillation data detection. The top panel illustrates the pipeline of the reasoning distillation that distils the reasoning capacities of LRMs to smaller LLMs. The bottom panel illustrates the process of detecting distillation data.
  • Figure 2: Scores distribution of Min-K% for members and non-members, obtained from the distilled model trained on the LIMO dataset using the Qwen2.5-32B-Instruct base model.
  • Figure 3: Comparison of token-level generation behaviour of distilled models for 20 member and 20 non-member questions under greedy decoding. (a) Token-wise probability distributions: we contrast the distribution of token-wise probability between members and non-members, showing that non-members tend to produce more tokens with lower probability. (b) Near-deterministic vs. non-deterministic tokens: near-deterministic tokens denote generated tokens with probabilities approaching 1, and vice versa for non-deterministic tokens. The distilled reasoning model tends to generate more near-deterministic tokens for members.
  • Figure 4: Effect of distillation data size (\ref{['train_size_auc']}) and parameter $\alpha$ (\ref{['alpha']}) on our method's performance.
  • Figure 5: Ablation study on hyperparameter $M$ and threshold $\tau$. We present AUC and TPR@$1\%$FPR of our method with varying truncation length $M$ on three datasets (\ref{['auc_length']} & \ref{['tpr_length']}), and AUC of our method under varying threshold $\tau$ on the S1 dataset (\ref{['tau']}).