Discovering Spoofing Attempts on Language Model Watermarks

Thibaud Gloaguen; Nikola Jovanović; Robin Staab; Martin Vechev

Discovering Spoofing Attempts on Language Model Watermarks

Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev

TL;DR

The paper investigates spoofing threats to LLM watermarks and introduces a statistically principled framework to distinguish spoofed text from genuine ξ-watermarked text. By exposing artifacts arising from a spoofer’s dependence on a training dataset of watermarked text, it designs two test regimes (Standard and Reprompting) based on a correlation-based statistic that, under appropriate assumptions, converges to a standard normal, enabling reliable hypothesis testing. Empirical results across multiple watermarking schemes, spoofer models, and text lengths demonstrate controlled Type I error and high power (often >90% at 1% FPR) as text length grows, highlighting a fundamental limitation of current learning-based spoofers. The work provides practical defenses for watermark attribution and offers generalizable methods for detecting watermark spoofing with broad applicability to different schemes, along with releasing accompanying code for reproducibility and further research.

Abstract

LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat. We make all our code available at https://github.com/eth-sri/watermark-spoofing-detection .

Discovering Spoofing Attempts on Language Model Watermarks

TL;DR

Abstract

Paper Structure (68 sections, 3 theorems, 30 equations, 15 figures, 5 tables)

This paper contains 68 sections, 3 theorems, 30 equations, 15 figures, 5 tables.

Introduction
LLM watermarks
Spoofing attacks
Discovering spoofing attempts
Key contributions
Background and Related Work
LLM watermarks
LLM watermark spoofing
Spoofing defenses
Broader work on LLM watermarking
Can Spoofing Attempts Be Discovered?
Problem statement
Formalization
Artifact: dependence between the color sequence and the context
A simple example
...and 53 more sections

Key Result

Lemma 4.1

Under the cross-independence between $X$ and $Y$, and technical assumptions (detailed in app:proofs), we have the convergence in distribution

Figures (15)

Figure 1: Overview of why spoofed text contains measurable artifacts. First, in (1), the spoofer generates a dataset $\mathcal{D}$ of $\xi$-watermarked texts from which they learn the watermark. As (2) illustrates, when later generating text, the spoofer is better at sampling a green token if (and only if) the context and the sampled token were in $\mathcal{D}$. This uncertainty introduces artifacts in the spoofed text. In contrast, the genuine watermarking algorithm is consistent with respect to the context and hence contains no such artifacts. Lastly, in (3), we build statistical tests for discovery of these artifacts, distinguishing between spoofed and $\xi$-watermarked texts even if their Z-scores $Z_\xi$ computed using the watermark detector are the same.
Figure 2: Histograms of $Z_S(\Omega)$ (top) and $Z_R(\Omega,\Omega')$ (bottom), with y-axes scaled to represent normalized density. The top row is computed using the unigram score and the Standard method, and the second row is computed using the $(h+1)$-gram score and the Reprompting method. A green line indicates that the $\mathcal{N}(0,1)$ hypothesis is not rejected (top p-value), an orange line that a normality test is not rejected (bottom p-value), and a red line that both are rejected at 5%.
Figure 3: Experimental rejection rate of $\xi$-watermarked text on Llama2 7B.
Figure 4: Experimental True Positive Rate of spoofed text. The dotted lines are the identity and serve as a reference for the expected rejection rate under the null. Since, in practice, a low false positive rate ($\alpha$) is desirable, the logarithmic scale on $\alpha$ highlights the true positive rate at low $\alpha$ values.
Figure 5: Evolution of $\mathbb{E}[Z_R(\Omega,\Omega')]$ for different spoofer LMs with $T$.
...and 10 more figures

Theorems & Definitions (4)

Lemma 4.1
Theorem 9.1: Lindeberg CLT
Theorem 9.2: Delta method
proof

Discovering Spoofing Attempts on Language Model Watermarks

TL;DR

Abstract

Discovering Spoofing Attempts on Language Model Watermarks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (15)

Theorems & Definitions (4)