Table of Contents
Fetching ...

Antidistillation Sampling

Yash Savani, Asher Trockman, Zhili Feng, Yixuan Even Xu, Avi Schwarzschild, Alexander Robey, Marc Finzi, J. Zico Kolter

TL;DR

Antidistillation sampling addresses the vulnerability of frontier LLMs to distillation by dynamically perturbing the teacher's decoding process to preserve nominal utility while reducing the downstream student’s performance. It introduces a proxy-model–based, gradient-informed penalty that is efficiently approximated via a finite-difference scheme, enabling real-time poisoning of reasoning traces without large utility loss for the teacher. Empirical results across GSM8K, MATH, and MMLU demonstrate controllable trade-offs: with carefully chosen λ, the teacher maintains high accuracy while the distilled student experiences substantial degradation compared to naive temperature-based decoding. The approach generalizes across model families and proxy-student configurations, suggesting practical applicability for protecting proprietary reasoning capabilities and intellectual property in large-scale frontier-model deployments.

Abstract

Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.

Antidistillation Sampling

TL;DR

Antidistillation sampling addresses the vulnerability of frontier LLMs to distillation by dynamically perturbing the teacher's decoding process to preserve nominal utility while reducing the downstream student’s performance. It introduces a proxy-model–based, gradient-informed penalty that is efficiently approximated via a finite-difference scheme, enabling real-time poisoning of reasoning traces without large utility loss for the teacher. Empirical results across GSM8K, MATH, and MMLU demonstrate controllable trade-offs: with carefully chosen λ, the teacher maintains high accuracy while the distilled student experiences substantial degradation compared to naive temperature-based decoding. The approach generalizes across model families and proxy-student configurations, suggesting practical applicability for protecting proprietary reasoning capabilities and intellectual property in large-scale frontier-model deployments.

Abstract

Frontier models that generate extended reasoning traces inadvertently produce rich token sequences that can facilitate model distillation. Recognizing this vulnerability, model owners may seek sampling strategies that limit the effectiveness of distillation without compromising model performance. Antidistillation sampling provides exactly this capability. By strategically modifying a model's next-token probability distribution, antidistillation sampling poisons reasoning traces, rendering them significantly less effective for distillation while preserving the model's practical utility. For further details, see https://antidistillation.com.

Paper Structure

This paper contains 19 sections, 14 equations, 11 figures, 1 algorithm.

Figures (11)

  • Figure 1: Reasoning traces generated via antidistillation sampling poison distillation attempts while simultaneously preserving the teacher's performance. The teacher's logits are perturbed in a direction $\Delta$, leading to samples that significantly degrade distilled model performance relative to naive temperature sampling. For more details, see \ref{['fig:delta-details']} and §\ref{['sec:method']}.
  • Figure 2: An illustration of approximating $\Delta$. The teacher model performs antidistillation sampling autoregressively, based on its perturbed distribution by $\Delta$. Given an input prompt and $t$ reasoning tokens from the teacher, $\Delta$ is approximated by the difference of the log probability of each token in the vocabulary between two copies of the proxy model (created by performing a single gradient ascent step using the downstream task loss on the proxy model); this difference is represented by the area in the bar plot.
  • Figure 3: Antidistillation sampling uses a tunable parameter $\lambda$ to control the trade-off between teacher accuracy and distillability. The baseline involves sampling from the teacher with increasing temperature $\tau$ to show that we can produce traces that are bad for distillation at some cost in teacher accuracy. One important feature of the blue temperature sampling curve is that to bring the student accuracy down below the undistilled accuracy, the teacher performance has to drop to 20%. On the other hand, with antidistillation sampling, the teacher model can still get 70% accuracy while producing traces that bring the student's performance down below the undistilled accuracy.
  • Figure 4: For both MMLU and MATH data, we show that antidistillation sampling can bring student accuracies down with relatively little cost to the teacher.
  • Figure 5: Distillation loss curves show that although the student's training loss decreases across steps, antidistillation sampling effectively poisons traces, as shown by the increasing student's holdout loss.
  • ...and 6 more figures