Table of Contents
Fetching ...

LLM Watermark Evasion via Bias Inversion

Jeongyeon Hwang, Sangdon Park, Jungseul Ok

TL;DR

The paper examines the robustness of LLM watermarking against adversarial paraphrase attacks. It derives a theoretical bound showing that lowering the average probability of green-token generation by $\\delta$ can drive the detection probability to $\\Pr[D] \\le \\exp(-\\tfrac{1}{2}N\\delta^2)$, and then presents BIRA, a black-box rewriting attack that uses a negative bias on a proxy green set derived from token self-information to suppress watermark signals during paraphrase. Empirically, BIRA achieves state-of-the-art evasion across seven watermarking schemes and multiple models, including near-total evasion (≈99% ASR) while preserving semantic fidelity. The results highlight a systematic vulnerability in current watermarking approaches and underscore the need for stress-testing and defenses that are robust to sophisticated paraphrase-based attacks.

Abstract

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

LLM Watermark Evasion via Bias Inversion

TL;DR

The paper examines the robustness of LLM watermarking against adversarial paraphrase attacks. It derives a theoretical bound showing that lowering the average probability of green-token generation by can drive the detection probability to , and then presents BIRA, a black-box rewriting attack that uses a negative bias on a proxy green set derived from token self-information to suppress watermark signals during paraphrase. Empirically, BIRA achieves state-of-the-art evasion across seven watermarking schemes and multiple models, including near-total evasion (≈99% ASR) while preserving semantic fidelity. The results highlight a systematic vulnerability in current watermarking approaches and underscore the need for stress-testing and defenses that are robust to sophisticated paraphrase-based attacks.

Abstract

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

Paper Structure

This paper contains 28 sections, 2 theorems, 25 equations, 8 figures, 13 tables, 2 algorithms.

Key Result

Theorem 1

Let the detector be $\mathcal{D}(y,\mathcal{W}_k)=\mathbf{1}\{Z(y;\mathcal{W}_k)\ge \tau\}$ and suppose there exists a nondecreasing function $h:[0,1]\!\to\!\mathbb{R}$ with where $\mathcal{G}(\mathcal{W}_k)$ denotes the green set produced by watermarking $\mathcal{W}_k$. Then, for a given $N$, there exists $p_\tau\in[0,1]$ such that with $p_\tau=\inf\{p:\, h(p)\ge \tau\}$. In particular, for th

Figures (8)

  • Figure 1: Illustration of BIRA. A watermarked LLM typically increases the likelihood of sampling green tokens by adding a positive bias $\gamma > 0$ to their logits at each generation step. In contrast, BIRA applies a negative bias $\beta < 0$ to a proxy set of green tokens (since the true set is unknown), thereby suppressing their sampling probability. This inversion lowers the probability of generating green tokens and weakens the watermark signal, enabling the paraphrased text to evade detection.
  • Figure 2: Comparison of detection performance with the adjusted threshold across watermarking algorithms, mitigating the effect of default threshold. We show the best F1 score ($\downarrow$) and TPR ($\downarrow$) at FPR of 1% and 10%. BIRA consistently achieves lower F1 and TPR than all baselines, indicating greater difficulty for detectors in distinguishing attacked text from human-written text. Exact values are provided in Appendix \ref{['appendix:detailed_dynamic']}.
  • Figure 3: Comparison of text quality across different attacks for various watermarking methods, evaluated by LLM judgment score ($\uparrow$), Self-BLEU score ($\downarrow$), and Perplexity ($\downarrow$). Our method preserves semantic fidelity to the original text compared to other attack baselines (DIPPER and SIRA) while providing stronger paraphrasing, as reflected in lower Self-BLEU scores. Additional results for NLI score ($\uparrow$) and S-BERT score ($\uparrow$) are provided in Figure \ref{['fig:appendix:NLI-S-bert']} and exact values are detailed in Appendix \ref{['appendix:detailed_text_quality']}.
  • Figure 3: $z$-score comparison of attacks on SIR and Unigram watermarking scheme.
  • Figure 4: Qualitative comparison of KGW-watermarked text and the same passage after a BIRA attack with Llama-3.1-8B. The attack paraphrases to suppress green tokens while preserving meaning, lowering the z score from 6.03 to 0.83 and evading detection at a threshold of 4. More examples with longer sentences and other watermarking schemes appear in Appendix \ref{['appendix:qualitative_examples']}.
  • ...and 3 more figures

Theorems & Definitions (4)

  • Theorem 1
  • Theorem 2
  • proof : Proof of Theorem \ref{['thm:equivalence']}
  • proof : Proof of Theorem \ref{['thm:evasion']}