Table of Contents
Fetching ...

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

Qi Pang, Shengyuan Hu, Wenting Zheng, Virginia Smith

TL;DR

It is shown that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack, leading to fundamental trade-offs in robustness, utility, and usability.

Abstract

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.

No Free Lunch in LLM Watermarking: Trade-offs in Watermarking Design Choices

TL;DR

It is shown that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack, leading to fundamental trade-offs in robustness, utility, and usability.

Abstract

Advances in generative models have made it possible for AI-generated text, code, and images to mirror human-generated content in many applications. Watermarking, a technique that aims to embed information in the output of a model to verify its source, is useful for mitigating the misuse of such AI-generated content. However, we show that common design choices in LLM watermarking schemes make the resulting systems surprisingly susceptible to attack -- leading to fundamental trade-offs in robustness, utility, and usability. To navigate these trade-offs, we rigorously study a set of simple yet effective attacks on common watermarking systems, and propose guidelines and defenses for LLM watermarking in practice.
Paper Structure (30 sections, 3 theorems, 28 equations, 26 figures, 3 tables)

This paper contains 30 sections, 3 theorems, 28 equations, 26 figures, 3 tables.

Key Result

Theorem 1

Consider a watermarked token sequence $\textbf{x}$ of length $l$. The Unigram watermark z-score threshold is $T$, the portion of the tokens in the green list is $\gamma$, the detection z-score of $\textbf{x}$ is $z$, and the number of inserted tokens is $s$. Then, to guarantee the expected z-score o

Figures (26)

  • Figure 1: Toxic token insertion.
  • Figure 2: Fluent inaccurate editing.
  • Figure 4: Spoofing attack based on watermark stealing jovanovic2024watermark and watermark-removal attacks on KGW watermark and LLAMA-2-7B model with different number of watermark keys $n$. Higher z-score reflects more confidence in watermarking and lower perplexity indicates better sentence quality. The attack success rates are based on the threshold with FPR@1e-3.
  • Figure 5: Attacks exploiting detection APIs on LLAMA-2-7B model.
  • Figure 6: Spoofing ASR and detection ACC.
  • ...and 21 more figures

Theorems & Definitions (9)

  • Definition 1: LM
  • Definition 2: Watermarked LLMs
  • Definition 3: Watermark robustness
  • Theorem 1: Maximum insertion portion
  • proof
  • Theorem 2: Probability bound of unwatermarked token estimation
  • proof
  • Theorem 3: Probability bound of unwatermarked token estimation for Exp
  • proof