Table of Contents
Fetching ...

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

TL;DR

It is demonstrated that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft, and that current output-level approaches are insufficient to broadly prevent knowledge theft.

Abstract

Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

TL;DR

It is demonstrated that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft, and that current output-level approaches are insufficient to broadly prevent knowledge theft.

Abstract

Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.
Paper Structure (69 sections, 4 equations, 5 figures, 4 tables)

This paper contains 69 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Taxonomy of output-level distillation defenses evaluated in this work. Each category targets a different mechanism: perturbation corrupts the signal, poisoning introduces adversarial errors, and throttling restricts information content.
  • Figure 2: Perturbation strength vs. student quality. Dashed lines show the baseline. No consistent degradation occurs at any perturbation level; $\alpha = 1.0$ slightly improves MATH performance.
  • Figure 3: Poisoning rate vs. student quality. MT-Bench (right axis) degrades monotonically while MATH and HumanEval+ remain stable. Dashed lines show baselines.
  • Figure 4: Average DE by defense category and benchmark. Each category primarily affects a different capability: poisoning targets conversation, throttling targets reasoning, perturbation affects none.
  • Figure 5: DE--DC trade-off for all nine defenses. The ideal defense (low DE, low DC) would occupy the bottom-left corner. Instead, all defenses either cluster near the top (high DE $\approx$ no protection) with low DC, or achieve lower DE only at substantial DC cost (A08). No defense reaches the ideal region.