Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection
Shuai Li, Kejiang Chen, Kunsheng Tang, Jie Zhang, Weiming Zhang, Nenghai Yu, Kai Zeng
TL;DR
This work tackles the problem of copyright protection for APIs and open-source LLMs by proposing a knowledge-injection watermarking framework that uses encoded knowledge as the watermark carrier. Watermarks are embedded into selected knowledge through ASCII-encoded tokens and LoRA-based fine-tuning, allowing watermark extraction via targeted prompts in a black-box setting. Empirical results show ESR near 1 across multiple models, with high fidelity, stealthiness, and robustness against fine-tuning, merging, and quantization attacks, outperforming backdoor-based baselines. The method provides a covert, scalable approach to watermark LLMs by shifting the watermark from generated text to embedded knowledge, enabling practical copyright verification and traceability for both APIs and open-source models.
Abstract
Large language models (LLMs) have demonstrated outstanding performance, making them valuable digital assets with significant commercial potential. Unfortunately, the LLM and its API are susceptible to intellectual property theft. Watermarking is a classic solution for copyright verification. However, most recent emerging LLM watermarking methods focus on identifying AI-generated texts rather than watermarking LLM itself. Only a few attempts are based on weight quantification and backdoor watermarking, which are not robust or covert enough, limiting their applicability in practice. To address this issue, we propose a novel watermarking method for LLMs based on knowledge injection and innovatively use knowledge as the watermark carrier. Specifically, in the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge, subsequently injected into the to-be-protected LLM. In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM and extracting the watermarks from its response. The experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.
