Table of Contents
Fetching ...

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Pingzhi Li, Zhen Tan, Mohan Zhang, Huaizhi Qu, Huan Liu, Tianlong Chen

TL;DR

This work tackles protecting proprietary LLMs from knowledge distillation when access is limited to API outputs. It introduces DOGe, a defense that adversarially tunes only the final LM head to subtly alter intermediate reasoning patterns, making distillation from outputs difficult while preserving the teacher’s utility. Through a dual-objective loss combining standard fine-tuning with a KL-divergence-based adversarial term and a reasoning-aware mask, DOGe degrades the performance of student models distilled from defended teachers by up to roughly 5× across multiple datasets, with the defender’s own performance either maintained or improved. The approach is parameter-efficient, generalizes across domains, and provides a practical pathway for IP protection of LLMs in real-world API-based scenarios.

Abstract

Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD). In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs are accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving the performance of the teacher model, student models distilled from the defensively generated outputs demonstrate catastrophically reduced performance, demonstrating DOGe as a practical safeguard against KD-based model imitation.

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

TL;DR

This work tackles protecting proprietary LLMs from knowledge distillation when access is limited to API outputs. It introduces DOGe, a defense that adversarially tunes only the final LM head to subtly alter intermediate reasoning patterns, making distillation from outputs difficult while preserving the teacher’s utility. Through a dual-objective loss combining standard fine-tuning with a KL-divergence-based adversarial term and a reasoning-aware mask, DOGe degrades the performance of student models distilled from defended teachers by up to roughly 5× across multiple datasets, with the defender’s own performance either maintained or improved. The approach is parameter-efficient, generalizes across domains, and provides a practical pathway for IP protection of LLMs in real-world API-based scenarios.

Abstract

Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD). In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs are accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving the performance of the teacher model, student models distilled from the defensively generated outputs demonstrate catastrophically reduced performance, demonstrating DOGe as a practical safeguard against KD-based model imitation.

Paper Structure

This paper contains 41 sections, 4 theorems, 11 equations, 10 figures, 2 tables, 1 algorithm.

Key Result

Proposition 4.1

Given Assumption ass:proxy_representativeness, training a teacher's LM head $\theta_{final}$ by minimizing the loss in Eq. eq:combined_loss with the masking in Eq. eq:masked_gradient yields a defensive teacher $\mathcal{T}^*_{\theta_{final}}$. A student model $S \in \mathcal{S}$ distilled from $\mat

Figures (10)

  • Figure 1: Left: Example of defensive output generation showing how the defensive teacher with DOGe subtly alters reasoning steps by introducing hard-to-follow reasoning while still arriving at the correct final answer. Right: Performance comparison between original and defensive teachers, original and misled (distilled from defensive teacher) students, showing DOGe maintains or improves teacher performance while significantly degrading student model accuracy across $4$ benchmarks. Here we employ Qwen3-8B as the teacher model, Llama-3.2-1B as the student model.
  • Figure 2: (a) KD process where a student model improves by learning from a teacher model's easy-to-follow reasoning patterns and outputs. (b) Defensive Training mechanism of DOGe, which trains the teacher model's LM head using the objective that preserves task performance while maximizing KL-divergence from proxy student outputs. (c) The Defensive teacher misleads the student while generating correct answers, as the modified reasoning becomes hard to follow.
  • Figure 3: Comparative evaluation of defensivev.s.original teacher models and misledv.s.original student models using GSM8K (math) for defensive training. For the single proxy model used in defensive training, we employ Qwen2.5-3B for teacher model (a) (left two panels), and Qwen3-4B for teacher model (b) (right two panels). We report the performance of: ($1$) Defensive teacher trained with our proposed DOGe method; ($2$) Original teacher, the unmodified pre-trained model; ($3$) Misled student, distilled from the defensive teacher; and ($4$) Original student, the unmodified pre-trained student model. Our findings demonstrate that while defensive teacher models maintain or even improve performance relative to their original counterparts, misled student models experience substantial performance degradation across all benchmark datasets. Results of using Tulu dataset for defensive training is given in Appendix \ref{['app:tulu']}. Similar trends are observed.
  • Figure 4: Varying adversarial loss coefficient $\lambda$ with the DeepSeek-R1-7B as teacher, Llama-3.2-1B as the student, and Qwen2.5-3B as the proxy student.
  • Figure 5: Ablation studies for DOGe defensive training. (a) Impact of number of proxy models. (b) Impact of training dataset choice.
  • ...and 5 more figures

Theorems & Definitions (5)

  • Proposition 4.1: Student Performance Degradation
  • Lemma 2.1: Gradient Discrepancy Bound
  • proof
  • Proposition 2.1: One-Step Lower Bound on Expected Loss Change
  • Corollary 2.0.1: Threshold on DOGe Divergence for Non-Improvement