DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Bo Jiang

TL;DR

It is demonstrated that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft, and that current output-level approaches are insufficient to broadly prevent knowledge theft.

Abstract

Knowledge distillation from proprietary LLM APIs poses a growing threat to model providers, yet defenses against this attack remain fragmented and unevaluated. We present DistillGuard, a framework for systematically evaluating output-level defenses against LLM knowledge distillation. We introduce a taxonomy of three defense categories -- output perturbation, data poisoning, and information throttling -- and evaluate nine defense configurations using a standardized pipeline with Qwen3-14B as teacher and Qwen2.5-7B-Instruct as student across three benchmarks (MATH-500, HumanEval+, MT-Bench). Our results reveal that, in a same-family distillation setting against a naive attacker, most output-level defenses are surprisingly ineffective: paraphrasing-based perturbation barely degrades distilled student quality, and data poisoning primarily impairs conversational fluency while leaving task-specific capabilities intact. Only chain-of-thought removal substantially impairs mathematical reasoning (31.4\% vs.\ 67.8\% baseline), though code generation remains unaffected. These findings demonstrate that the effectiveness of distillation defenses is highly task-dependent and that current output-level approaches are insufficient to broadly prevent knowledge theft.

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

TL;DR

Abstract

Paper Structure (69 sections, 4 equations, 5 figures, 4 tables)

This paper contains 69 sections, 4 equations, 5 figures, 4 tables.

Introduction
A Taxonomy of Distillation Defenses
Output Perturbation
Data Poisoning
Information Throttling
Chain-of-thought (CoT) removal.
Token truncation.
Excluded Defense Categories
Evaluation Framework
Threat Model
Model provider.
Attacker.
Scope.
The ProtectedAPI Abstraction
Distillation Pipeline
...and 54 more sections

Figures (5)

Figure 1: Taxonomy of output-level distillation defenses evaluated in this work. Each category targets a different mechanism: perturbation corrupts the signal, poisoning introduces adversarial errors, and throttling restricts information content.
Figure 2: Perturbation strength vs. student quality. Dashed lines show the baseline. No consistent degradation occurs at any perturbation level; $\alpha = 1.0$ slightly improves MATH performance.
Figure 3: Poisoning rate vs. student quality. MT-Bench (right axis) degrades monotonically while MATH and HumanEval+ remain stable. Dashed lines show baselines.
Figure 4: Average DE by defense category and benchmark. Each category primarily affects a different capability: poisoning targets conversation, throttling targets reasoning, perturbation affects none.
Figure 5: DE--DC trade-off for all nine defenses. The ideal defense (low DE, low DC) would occupy the bottom-left corner. Instead, all defenses either cluster near the top (high DE $\approx$ no protection) with low DC, or achieve lower DE only at substantial DC cost (A08). No defense reaches the ideal region.

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

TL;DR

Abstract

DistillGuard: Evaluating Defenses Against LLM Knowledge Distillation

Authors

TL;DR

Abstract

Table of Contents

Figures (5)