Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma; William Yeoh; Ning Zhang; Yevgeniy Vorobeychik

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Xinhang Ma, William Yeoh, Ning Zhang, Yevgeniy Vorobeychik

TL;DR

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting investigates trace-level defenses to deter unauthorized distillation of frontier LLMs. It introduces two complementary objectives—anti-distillation and API watermarking—realized via instruction-based rewriting and gradient-based rewrites that preserve semantics while degrading downstream training or embedding verifiable signatures. Empirical results show state-of-the-art anti-distillation effects (up to 61.3% student accuracy reduction) with minimal teacher impact, and highly reliable watermark detection with near-zero false alarms under various distillation and filtering scenarios. The work presents a practical, trace-centric framework for model protection with implications for licensing, ownership, and security in real-world deployments.

Abstract

Knowledge distillation is a widely adopted technique for transferring capabilities from LLMs to smaller, more efficient student models. However, unauthorized use of knowledge distillation takes unfair advantage of the considerable effort and cost put into developing frontier models. We investigate methods for modifying teacher-generated reasoning traces to achieve two objectives that deter unauthorized distillation: (1) \emph{anti-distillation}, or degrading the training usefulness of query responses, and (2) \emph{API watermarking}, which embeds verifiable signatures in student models. We introduce several approaches for dynamically rewriting a teacher's reasoning outputs while preserving answer correctness and semantic coherence. Two of these leverage the rewriting capabilities of LLMs, while others use gradient-based techniques. Our experiments show that a simple instruction-based rewriting approach achieves a strong anti-distillation effect while maintaining or even improving teacher performance. Furthermore, we show that our rewriting approach also enables highly reliable watermark detection with essentially no false alarms.

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

TL;DR

Abstract

Paper Structure (43 sections, 11 equations, 15 figures, 1 table)

This paper contains 43 sections, 11 equations, 15 figures, 1 table.

Introduction
Related Work
Preliminaries
LLMs and Reasoning
Knowledge Distillation
Model
Problem Setting
Anti-Distillation
API Watermarking
Constraints on Rewriting
Methodology
Instruction-Based Rewriting
Semantic Prompting
Optimized Prompting
Gradient-Based Rewriting
...and 28 more sections

Figures (15)

Figure 1: Overview of instruction-based rewriting.: (a) Clean trace generation: The teacher model $\mathcal{T}$ generates a reasoning trace $r$ for given task (query) $q$ using a standard generation instruction $p_g$. (b) Rewriting: A rewrite model $\mathcal{R}$ with a rewrite instruction $p_r$ transforms $r$ into $r'$ to achieve IP protection while maintaining utility.
Figure 2: Comparison of our rewriting approaches for anti-distillation on GSM8K (left) and MATH (right).
Figure 3: Anti-distillation comparisons on GSM8K (left) and MATH (right). Our method achieves the strongest anti-distillation effect without compromising the teacher's utility.
Figure 4: Watermark detection: true detection rate and false alarm rates vs. K for llama3.1-8B suspect student model.
Figure 5: Anti-distillation effects of the Token-Level poisoning method, where FO-Grad is the adversarial approximation of the actual objective, similar to how they are defined in Section \ref{['sec:method:subsec:gradient_rewriting']}.
...and 10 more figures

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

TL;DR

Abstract

Protecting Language Models Against Unauthorized Distillation through Trace Rewriting

Authors

TL;DR

Abstract

Table of Contents

Figures (15)