LLM Watermark Evasion via Bias Inversion
Jeongyeon Hwang, Sangdon Park, Jungseul Ok
TL;DR
The paper examines the robustness of LLM watermarking against adversarial paraphrase attacks. It derives a theoretical bound showing that lowering the average probability of green-token generation by $\\delta$ can drive the detection probability to $\\Pr[D] \\le \\exp(-\\tfrac{1}{2}N\\delta^2)$, and then presents BIRA, a black-box rewriting attack that uses a negative bias on a proxy green set derived from token self-information to suppress watermark signals during paraphrase. Empirically, BIRA achieves state-of-the-art evasion across seven watermarking schemes and multiple models, including near-total evasion (≈99% ASR) while preserving semantic fidelity. The results highlight a systematic vulnerability in current watermarking approaches and underscore the need for stress-testing and defenses that are robust to sophisticated paraphrase-based attacks.
Abstract
Watermarking for large language models (LLMs) embeds a statistical signal during generation to enable detection of model-produced text. While watermarking has proven effective in benign settings, its robustness under adversarial evasion remains contested. To advance a rigorous understanding and evaluation of such vulnerabilities, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), which is theoretically motivated and model-agnostic. BIRA weakens the watermark signal by suppressing the logits of likely watermarked tokens during LLM-based rewriting, without any knowledge of the underlying watermarking scheme. Across recent watermarking methods, BIRA achieves over 99\% evasion while preserving the semantic content of the original text. Beyond demonstrating an attack, our results reveal a systematic vulnerability, emphasizing the need for stress testing and robust defenses.
