Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection

Shuai Li; Kejiang Chen; Kunsheng Tang; Jie Zhang; Weiming Zhang; Nenghai Yu; Kai Zeng

Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection

Shuai Li, Kejiang Chen, Kunsheng Tang, Jie Zhang, Weiming Zhang, Nenghai Yu, Kai Zeng

TL;DR

This work tackles the problem of copyright protection for APIs and open-source LLMs by proposing a knowledge-injection watermarking framework that uses encoded knowledge as the watermark carrier. Watermarks are embedded into selected knowledge through ASCII-encoded tokens and LoRA-based fine-tuning, allowing watermark extraction via targeted prompts in a black-box setting. Empirical results show ESR near 1 across multiple models, with high fidelity, stealthiness, and robustness against fine-tuning, merging, and quantization attacks, outperforming backdoor-based baselines. The method provides a covert, scalable approach to watermark LLMs by shifting the watermark from generated text to embedded knowledge, enabling practical copyright verification and traceability for both APIs and open-source models.

Abstract

Large language models (LLMs) have demonstrated outstanding performance, making them valuable digital assets with significant commercial potential. Unfortunately, the LLM and its API are susceptible to intellectual property theft. Watermarking is a classic solution for copyright verification. However, most recent emerging LLM watermarking methods focus on identifying AI-generated texts rather than watermarking LLM itself. Only a few attempts are based on weight quantification and backdoor watermarking, which are not robust or covert enough, limiting their applicability in practice. To address this issue, we propose a novel watermarking method for LLMs based on knowledge injection and innovatively use knowledge as the watermark carrier. Specifically, in the watermark embedding stage, we first embed the watermarks into the selected knowledge to obtain the watermarked knowledge, subsequently injected into the to-be-protected LLM. In the watermark extraction stage, questions related to the watermarked knowledge are designed, for querying the suspect LLM and extracting the watermarks from its response. The experiments show that the watermark extraction success rate is close to 100% and demonstrate the effectiveness, fidelity, stealthiness, and robustness of our proposed method.

Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection

TL;DR

Abstract

Paper Structure (25 sections, 13 equations, 6 figures, 14 tables)

This paper contains 25 sections, 13 equations, 6 figures, 14 tables.

Introduction
Related Work
Knowledge Injection
Deep Model Watermarking
Large Language Model Watermarking
Preliminary
The Definition of Technical Terms
Threat Model
Methodology
Watermark Injection
Watermark Extraction
Experiment
Experiment Setting
Effectiveness
Fidelity
...and 10 more sections

Figures (6)

Figure 1: The framework of the watermarking method via knowledge injection. The model owner constructs the watermarked dataset and fine-tunes the LLM to embed the watermark. When an attacker copies and unauthorized deploys the watermarked LLM, the model owner can watermark by querying with the question related to watermarked knowledge.
Figure 2: The examples of watermarked knowledge. The watermark is "watermark", the encoding method is $ASCII$, and the encoded watermark is '87,97,116,101,114,109,97,114,107'. The encoded watermark is embeded in the list, set, or string of the Python functions.
Figure 3: The watermark extraction success rate of Baize-7b-v2, LLaMA-7b and Vicuna-7b-v1.5 under different watermark ratios. The external datasets are Code and Dolly.
Figure 4: The watermark extraction success rate under different watermark capacities.
Figure 5: The inference case of backdoor-based watermarked LLM.
...and 1 more figures

Theorems & Definitions (1)

proof

Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection

TL;DR

Abstract

Turning Your Strength into Watermark: Watermarking Large Language Model via Knowledge Injection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (1)