DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Pingzhi Li; Zhen Tan; Mohan Zhang; Huaizhi Qu; Huan Liu; Tianlong Chen

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

Pingzhi Li, Zhen Tan, Mohan Zhang, Huaizhi Qu, Huan Liu, Tianlong Chen

TL;DR

This work tackles protecting proprietary LLMs from knowledge distillation when access is limited to API outputs. It introduces DOGe, a defense that adversarially tunes only the final LM head to subtly alter intermediate reasoning patterns, making distillation from outputs difficult while preserving the teacher’s utility. Through a dual-objective loss combining standard fine-tuning with a KL-divergence-based adversarial term and a reasoning-aware mask, DOGe degrades the performance of student models distilled from defended teachers by up to roughly 5× across multiple datasets, with the defender’s own performance either maintained or improved. The approach is parameter-efficient, generalizes across domains, and provides a practical pathway for IP protection of LLMs in real-world API-based scenarios.

Abstract

Large Language Models (LLMs) represent substantial intellectual and economic investments, yet their effectiveness can inadvertently facilitate model imitation via knowledge distillation (KD). In practical scenarios, competitors can distill proprietary LLM capabilities by simply observing publicly accessible outputs, akin to reverse-engineering a complex performance by observation alone. Existing protective methods like watermarking only identify imitation post-hoc, while other defenses assume the student model mimics the teacher's internal logits, rendering them ineffective against distillation purely from observed output text. This paper confronts the challenge of actively protecting LLMs within the realistic constraints of API-based access. We introduce an effective and efficient Defensive Output Generation (DOGe) strategy that subtly modifies the output behavior of an LLM. Its outputs are accurate and useful for legitimate users, yet are designed to be misleading for distillation, significantly undermining imitation attempts. We achieve this by fine-tuning only the final linear layer of the teacher LLM with an adversarial loss. This targeted training approach anticipates and disrupts distillation attempts during inference time. Our experiments show that, while preserving the performance of the teacher model, student models distilled from the defensively generated outputs demonstrate catastrophically reduced performance, demonstrating DOGe as a practical safeguard against KD-based model imitation.

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

TL;DR

Abstract

DOGe: Defensive Output Generation for LLM Protection Against Knowledge Distillation

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (5)