LLM Watermark Evasion via Bias Inversion

Jeongyeon Hwang; Sangdon Park; Jungseul Ok

LLM Watermark Evasion via Bias Inversion

Jeongyeon Hwang, Sangdon Park, Jungseul Ok

TL;DR

The paper examines the robustness of LLM watermarking against adversarial paraphrase attacks. It derives a theoretical bound showing that lowering the average probability of green-token generation by $\\delta$ can drive the detection probability to $\\Pr[D] \\le \\exp(-\\tfrac{1}{2}N\\delta^2)$, and then presents BIRA, a black-box rewriting attack that uses a negative bias on a proxy green set derived from token self-information to suppress watermark signals during paraphrase. Empirically, BIRA achieves state-of-the-art evasion across seven watermarking schemes and multiple models, including near-total evasion (≈99% ASR) while preserving semantic fidelity. The results highlight a systematic vulnerability in current watermarking approaches and underscore the need for stress-testing and defenses that are robust to sophisticated paraphrase-based attacks.

Abstract

Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.

LLM Watermark Evasion via Bias Inversion

TL;DR

The paper examines the robustness of LLM watermarking against adversarial paraphrase attacks. It derives a theoretical bound showing that lowering the average probability of green-token generation by

can drive the detection probability to

, and then presents BIRA, a black-box rewriting attack that uses a negative bias on a proxy green set derived from token self-information to suppress watermark signals during paraphrase. Empirically, BIRA achieves state-of-the-art evasion across seven watermarking schemes and multiple models, including near-total evasion (≈99% ASR) while preserving semantic fidelity. The results highlight a systematic vulnerability in current watermarking approaches and underscore the need for stress-testing and defenses that are robust to sophisticated paraphrase-based attacks.

LLM Watermark Evasion via Bias Inversion

TL;DR

Abstract

LLM Watermark Evasion via Bias Inversion

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (4)