Antidistillation Sampling
Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter
TL;DR
Antidistillation sampling addresses the vulnerability of frontier LLMs to distillation by dynamically perturbing the teacher's decoding process to preserve nominal utility while reducing the downstream student’s performance. It introduces a proxy-model–based, gradient-informed penalty that is efficiently approximated via a finite-difference scheme, enabling real-time poisoning of reasoning traces without large utility loss for the teacher. Empirical results across GSM8K, MATH, and MMLU demonstrate controllable trade-offs: with carefully chosen λ, the teacher maintains high accuracy while the distilled student experiences substantial degradation compared to naive temperature-based decoding. The approach generalizes across model families and proxy-student configurations, suggesting practical applicability for protecting proprietary reasoning capabilities and intellectual property in large-scale frontier-model deployments.
Abstract
Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.
