ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

Kaiyi Pang; Tao Qi; Chuhan Wu; Minhao Bai; Minghu Jiang; Yongfeng Huang

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

Kaiyi Pang, Tao Qi, Chuhan Wu, Minhao Bai, Minghu Jiang, Yongfeng Huang

TL;DR

This paper tackles IP protection for large language models against model extraction by introducing ModelShield, a plug‑and‑play, adaptive watermarking framework that uses system prompts to invisibly embed watermark words in generated content without retraining. It couples this self‑watermarking with a robust two‑tier infringement detection mechanism: rapid verification using Sentence Watermark Scores and a t‑test against a threshold derived from human text, plus detailed KS‑test verification for deeper comparison, ensuring robustness against adversarial attacks. Empirical evaluations on HC3 and WILD datasets with ChatGPT as the victim and GPT‑2 Large, Llama2, and Mistral as imitators demonstrate that watermarks are learnable by imitation models and detectable with high sensitivity, while preserving QA performance and text quality (significant degradation avoided). The results show strong generalization, efficiency (low watermark token overhead), and resilience to prompt injections and data mixtures, indicating practical applicability for LMaaS providers to protect model IP at low cost and with minimal disruption to users.

Abstract

Large language models (LLMs) demonstrate general intelligence across a variety of machine learning tasks, thereby enhancing the commercial value of their intellectual property (IP). To protect this IP, model owners typically allow user access only in a black-box manner, however, adversaries can still utilize model extraction attacks to steal the model intelligence encoded in model generation. Watermarking technology offers a promising solution for defending against such attacks by embedding unique identifiers into the model-generated content. However, existing watermarking methods often compromise the quality of generated content due to heuristic alterations and lack robust mechanisms to counteract adversarial strategies, thus limiting their practicality in real-world scenarios. In this paper, we introduce an adaptive and robust watermarking method (named ModelShield) to protect the IP of LLMs. Our method incorporates a self-watermarking mechanism that allows LLMs to autonomously insert watermarks into their generated content to avoid the degradation of model content. We also propose a robust watermark detection mechanism capable of effectively identifying watermark signals under the interference of varying adversarial strategies. Besides, ModelShield is a plug-and-play method that does not require additional model training, enhancing its applicability in LLM deployments. Extensive evaluations on two real-world datasets and three LLMs demonstrate that our method surpasses existing methods in terms of defense effectiveness and robustness while significantly reducing the degradation of watermarking on the model-generated content.

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

TL;DR

Abstract

Paper Structure (33 sections, 7 equations, 11 figures, 12 tables)

This paper contains 33 sections, 7 equations, 11 figures, 12 tables.

Introduction
Related Works
Model Extraction Attack
Types of APIs
Defense against Model Extraction Attack
Real-time defenses
Post-attack defenses
Language model watermarking method against model extraction attack
Methodology
Preliminary
The basic property of language model IP protection watermarks
Adaptive Self-Watermarking Mechanism
Robust IP Infringement Detection
Experiments and Analysis
Self-Generated Watermark Datasets
...and 18 more sections

Figures (11)

Figure 1: Language models offering services are at risk of model extraction attacks, which unfold in three steps, depicted in the diagram with blue boxes. Attackers first collect data output by the victim model through queries, then use these data to train their own imitation models. Eventually, the imitation model achieves performance comparable to the victim model, significantly endangering the IP of the victim model. IP protection watermarking involves embedding a special watermark signal in the output of the victim model. When watermarked data are used to train an imitation model, the watermark signal can still be detected in the model's output. The stages of watermark generation and extraction are highlighted in yellow boxes in the diagram.
Figure 2: The workflow of our watermarking method. Our watermarking method involves two main processes: embedding and extraction, indicated by blue and yellow frames, respectively. In the embedding phase, a user's query is combined with an automatic watermark generation instruction as input to the victim model, which then produces watermarked text outputs. These outputs can be used by malicious users to train imitation models; To verify if suspect models were trained using the victim model's outputs, we analyze the query history between the suspect model owner and the victim model and conduct watermark detection on the suspect model's text outputs. If a t-test on the top 1% of data with high watermark scores results in a $p$-value below 0.05, we conclude that the suspect model is an imitation trained with watermarked data.
Figure 3: Main Experiment Results: We trained imitation models ($M_I$) using 4000 watermarked data based on three different foundational models, repeating the process ten times. The results demonstrate the average sentence watermark score from all texts generated by the imitation models, as well as their question-answering performance. Compared to outputs from the original foundational models ($M_O$) and models trained with normal data ($M_L$), the watermark scores of the imitation models were significantly higher, while their question-answering performance remained comparable to that of normal models trained without watermarked data ($M_L$).
Figure 4: Cumulative Distribution Function (CDF) of Sentence Watermark Scores: This graph shows the CDF of 4000 sentence watermark scores of 4000 samples for $M_I$, $M_L$, and $M_O$. It clearly illustrates that the distribution of the Imitation model is significantly distinct from the other two types.
Figure 5: Different watermark strategy. We tested the average sentence watermark score of three base models (Gpt2, Llama2, and Mistral) on the HC3 and WILD datasets.
...and 6 more figures

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

TL;DR

Abstract

ModelShield: Adaptive and Robust Watermark against Model Extraction Attack

Authors

TL;DR

Abstract

Table of Contents

Figures (11)