Optimizing Adaptive Attacks against Watermarks for Language Models
Abdulrahman Diaa, Toluwani Aremu, Nils Lukas
TL;DR
This work demonstrates that content watermarks for Large Language Models are vulnerable to adaptive, offline attacks that are feasible with small open-weight paraphrasers. By formulating robustness as an optimization objective and using preference-based data collection, the authors train paraphrasers to evade detection across multiple watermarking schemes while preserving text quality, achieving evasion rates exceeding 96% at low cost. They show strong transfer to unseen watermarks and Pareto-optimal performance across adaptive and non-adaptive settings, challenging prior robustness claims. The results motivate incorporating adaptive threat models into watermark design and defense strategies, and the authors publicly release their adaptive paraphrasers to accelerate further research.
Abstract
Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively optimized paraphrasers at https://github.com/nilslukas/ada-wm-evasion.
