Table of Contents
Fetching ...

Bypassing LLM Watermarks with Color-Aware Substitutions

Qilong Wu, Varun Chandrasekaran

TL;DR

This paper analyzes the robustness of logit-perturbation text watermarks that bias a set of green tokens and detect watermarking via a z-score on green-token usage. It introduces Self Color Testing-based Substitution (SCTS), a color-aware attack that extracts token-color information by prompting watermarked LLMs to generate controlled strings and then replaces green tokens with red ones within a constrained edit budget. The authors provide a theoretical treatment showing watermark strength scales as $\mathbb{E}[z] \propto \sqrt{T_e}$ and that, under favorable conditions ($q>\gamma$), detection becomes exponentially unlikely as text length grows; empirically, SCTS significantly lowers AUROC and raises attack success across two models and two watermark schemes, often achieving AUROC < 0.5 at budgets around $0.25$–$0.35$. The work highlights a substantial vulnerability in current watermarking approaches for long texts and argues for developing more robust defenses, while also acknowledging ethical considerations and the dual-use risks of watermark evasion research.

Abstract

Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (``green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the first ``color-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.

Bypassing LLM Watermarks with Color-Aware Substitutions

TL;DR

This paper analyzes the robustness of logit-perturbation text watermarks that bias a set of green tokens and detect watermarking via a z-score on green-token usage. It introduces Self Color Testing-based Substitution (SCTS), a color-aware attack that extracts token-color information by prompting watermarked LLMs to generate controlled strings and then replaces green tokens with red ones within a constrained edit budget. The authors provide a theoretical treatment showing watermark strength scales as and that, under favorable conditions (), detection becomes exponentially unlikely as text length grows; empirically, SCTS significantly lowers AUROC and raises attack success across two models and two watermark schemes, often achieving AUROC < 0.5 at budgets around . The work highlights a substantial vulnerability in current watermarking approaches for long texts and argues for developing more robust defenses, while also acknowledging ethical considerations and the dual-use risks of watermark evasion research.

Abstract

Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (``green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the first ``color-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.
Paper Structure (42 sections, 1 theorem, 34 equations, 7 figures, 16 tables, 2 algorithms)

This paper contains 42 sections, 1 theorem, 34 equations, 7 figures, 16 tables, 2 algorithms.

Key Result

Theorem 1

$\mathbb{E}[z]$ is proportional to $\sqrt{T_e}$. To elaborate, first assume the colors of $c+1$-grams are independent. Then, we have: Furthermore, if the color for different $c+1$-grams is green, is i.i.d., then:

Figures (7)

  • Figure 1: Illustration of the setting for SCTS. The red box indicates the attacker's capability.
  • Figure 2: Illustration of one substitution in SCTS for simplicity. Take different actions depending on the frequency in the SCT test.
  • Figure 3: AUROC for vicuna-7b-v1.5-16k, 50 samples, UMD watermarking, $c=4$. The orange curve (SCTS) is consistently and significantly above other baselines, and it is the only one cross $0.5$.
  • Figure 4: Confusion matrix and accuracy for SCT over 1000 samples for vicuna-7b-v1.5-16k, UMD. Accuracies are at least $0.5$ for all $c$ and hashing.
  • Figure 5: ASR for vicuna-7b-v1.5-16k, 50 samples, UMD, $c=4$, $z_{th}=4$. SCTS (orange) can significantly evade more detection under the same budget than other baselines.
  • ...and 2 more figures

Theorems & Definitions (11)

  • Definition 1: $c+1$-gram
  • Definition 2: Effective length $T_e$
  • Definition 3: Detection threshold $z_{th}$
  • Definition 4: Critical length $T_c$
  • Definition 5: Average green probability $q$
  • Theorem 1
  • Definition 6
  • Definition 7: New $2$-gram ratio $r_n$
  • Definition 8: Number of LLM calls $N_{T_e}$
  • Definition 9: $N_{new}$
  • ...and 1 more