Can LLM Watermarks Robustly Prevent Unauthorized Knowledge Distillation?
Leyi Pan, Aiwei Liu, Shiyu Huang, Yijian Lu, Xuming Hu, Lijie Wen, Irwin King, Philip S. Yu
TL;DR
This work addresses the robustness of LLM watermarking against adversarial removal aimed at enabling unauthorized knowledge distillation. It introduces three removal mechanisms—pre-distillation Untargeted/Targeted Paraphrasing (UP/TP) and post-distillation Watermark Neutralization (WN)—plus a watermark-stealing framework to infer rules without exact scheme access ($n$-gram window size and inverse strength $\\delta'$). Empirical results show that TP and WN can thoroughly remove inherited watermarks, with WN also preserving knowledge transfer and incurring low overhead, challenging current watermarking defenses. The findings underscore the need for more robust, diverse defenses—particularly in multi-source and latency-aware settings—before watermarking can reliably deter unauthorized knowledge distillation in production LLM systems. Limitations include a single teacher model and English-language tasks, suggesting future work on broader architectures, data scales, and domain coverage.
Abstract
The radioactive nature of Large Language Model (LLM) watermarking enables the detection of watermarks inherited by student models when trained on the outputs of watermarked teacher models, making it a promising tool for preventing unauthorized knowledge distillation. However, the robustness of watermark radioactivity against adversarial actors remains largely unexplored. In this paper, we investigate whether student models can acquire the capabilities of teacher models through knowledge distillation while avoiding watermark inheritance. We propose two categories of watermark removal approaches: pre-distillation removal through untargeted and targeted training data paraphrasing (UP and TP), and post-distillation removal through inference-time watermark neutralization (WN). Extensive experiments across multiple model pairs, watermarking schemes and hyper-parameter settings demonstrate that both TP and WN thoroughly eliminate inherited watermarks, with WN achieving this while maintaining knowledge transfer efficiency and low computational overhead. Given the ongoing deployment of watermarking techniques in production LLMs, these findings emphasize the urgent need for more robust defense strategies. Our code is available at https://github.com/THU-BPM/Watermark-Radioactivity-Attack.
