No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Qi Pang; Shengyuan Hu; Wenting Zheng; Virginia Smith

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith

TL;DR

It is shown that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack, leading to fundamental trade-offs in robustness, utility, and usability.

Abstract

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

TL;DR

Abstract

Paper Structure (30 sections, 3 theorems, 28 equations, 26 figures, 3 tables)

This paper contains 30 sections, 3 theorems, 28 equations, 26 figures, 3 tables.

Introduction
Related Work
Preliminaries
Threat Model
Attacking Robust Watermarks
Evaluation
Discussion
Attacking Stealing-Resistant Watermarks
Evaluation
Discussion
Attacking Watermark Detection APIs
Attack Procedures
Evaluation
Defending Detection with Differential Privacy
Discussion
...and 15 more sections

Key Result

Theorem 1

Consider a watermarked token sequence $\textbf{x}$ of length $l$. The Unigram watermark z-score threshold is $T$, the portion of the tokens in the green list is $\gamma$, the detection z-score of $\textbf{x}$ is $z$, and the number of inserted tokens is $s$. Then, to guarantee the expected z-score o

Figures (26)

Figure 1: Toxic token insertion.
Figure 2: Fluent inaccurate editing.
Figure 4: Spoofing attack based on watermark stealing jovanovic2024watermark and watermark-removal attacks on KGW watermark and LLAMA-2-7B model with different number of watermark keys $n$. Higher z-score reflects more confidence in watermarking and lower perplexity indicates better sentence quality. The attack success rates are based on the threshold with FPR@1e-3.
Figure 5: Attacks exploiting detection APIs on LLAMA-2-7B model.
Figure 6: Spoofing ASR and detection ACC.
...and 21 more figures

Theorems & Definitions (9)

Definition 1: LM
Definition 2: Watermarked LLMs
Definition 3: Watermark robustness
Theorem 1: Maximum insertion portion
proof
Theorem 2: Probability bound of unwatermarked token estimation
proof
Theorem 3: Probability bound of unwatermarked token estimation for Exp
proof

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

TL;DR

Abstract

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (9)