Detecting Distillation Data from Reasoning Models
Hengxiang Zhang, Hyeong Kyu Choi, Sharon Li, Hongxin Wei
TL;DR
This work tackles benchmark contamination from reasoning distillation by introducing a distillation-data detection task that operates with only question inputs. It proposes Token Probability Deviation (TBD), which leverages generated-token probabilities from a distilled model to distinguish seen (member) questions from unseen (non-member) ones, focusing on deviations from a high reference probability. Through extensive experiments on S1, S1.1, and LIMO with models like Qwen2.5-32B-Instruct, TBD achieves strong detection performance (e.g., AUC up to 0.918 and TPR@1%FPR around 0.470) and demonstrates robustness across model sizes, data scales, and hyperparameters. The method’s reliance on generated tokens makes it effective under partial data availability and offers a practical tool for transparency and fairness in distillation-based reasoning systems.
Abstract
Reasoning distillation has emerged as an efficient and powerful paradigm for enhancing the reasoning capabilities of large language models. However, reasoning distillation may inadvertently cause benchmark contamination, where evaluation data included in distillation datasets can inflate performance metrics of distilled models. In this work, we formally define the task of distillation data detection, which is uniquely challenging due to the partial availability of distillation data. Then, we propose a novel and effective method Token Probability Deviation (TBD), which leverages the probability patterns of the generated output tokens. Our method is motivated by the analysis that distilled models tend to generate near-deterministic tokens for seen questions, while producing more low-probability tokens for unseen questions. Our key idea behind TBD is to quantify how far the generated tokens' probabilities deviate from a high reference probability. In effect, our method achieves competitive detection performance by producing lower scores for seen questions than for unseen questions. Extensive experiments demonstrate the effectiveness of our method, achieving an AUC of 0.918 and a TPR@1% FPR of 0.470 on the S1 dataset.
