Optimizing Adaptive Attacks against Watermarks for Language Models

Abdulrahman Diaa; Toluwani Aremu; Nils Lukas

Optimizing Adaptive Attacks against Watermarks for Language Models

Abdulrahman Diaa, Toluwani Aremu, Nils Lukas

TL;DR

This work demonstrates that content watermarks for Large Language Models are vulnerable to adaptive, offline attacks that are feasible with small open-weight paraphrasers. By formulating robustness as an optimization objective and using preference-based data collection, the authors train paraphrasers to evade detection across multiple watermarking schemes while preserving text quality, achieving evasion rates exceeding 96% at low cost. They show strong transfer to unseen watermarks and Pareto-optimal performance across adaptive and non-adaptive settings, challenging prior robustness claims. The results motivate incorporating adaptive threat models into watermark design and defense strategies, and the authors publicly release their adaptive paraphrasers to accelerate further research.

Abstract

Large Language Models (LLMs) can be misused to spread unwanted content at scale. Content watermarking deters misuse by hiding messages in content, enabling its detection using a secret watermarking key. Robustness is a core security property, stating that evading detection requires (significant) degradation of the content's quality. Many LLM watermarking methods have been proposed, but robustness is tested only against non-adaptive attackers who lack knowledge of the watermarking method and can find only suboptimal attacks. We formulate watermark robustness as an objective function and use preference-based optimization to tune adaptive attacks against the specific watermarking method. Our evaluation shows that (i) adaptive attacks evade detection against all surveyed watermarks, (ii) training against any watermark succeeds in evading unseen watermarks, and (iii) optimization-based attacks are cost-effective. Our findings underscore the need to test robustness against adaptively tuned attacks. We release our adaptively optimized paraphrasers at https://github.com/nilslukas/ada-wm-evasion.

Optimizing Adaptive Attacks against Watermarks for Language Models

TL;DR

Abstract

Paper Structure (33 sections, 5 equations, 16 figures, 6 tables, 1 algorithm)

This paper contains 33 sections, 5 equations, 16 figures, 6 tables, 1 algorithm.

Introduction
Background
Threat Model
Related Work
Conceptual Approach
Robustness as an Objective Function
Preference Dataset Curation
Evaluation
Preference Dataset Collection
Ablation Studies
Experimental Results
Discussion
Conclusion
Appendix
Quality Metrics
...and 18 more sections

Figures (16)

Figure 1: Adaptive attackers know the watermarking algorithms (KeyGen, Verify), but not the secret key, so they can optimize a paraphraser against a specific watermark.
Figure 2: \ref{['alg:preference-dataset-collection']} paraphrases text $N$ times in lines 13-17. This graph shows the expected evasion rate of the best sample (lines 15-17) for the number of paraphrases using a vanilla Llama2-7b as the paraphraser.
Figure 3: The evasion rates (Left) and text quality measured with LLM-Judge (Right). The attacker uses a matching Llama2-7b surrogate and paraphraser model versus the provider's Llama2-13b. Results for adaptive attacks are on the diagonal. For example, we obtain the bottom left value by training on Dist-Shift and testing on Inverse.
Figure 4: Adaptive attacks are Pareto-optimal. We show the evasion rate versus text quality trade-off against the Expaaronson2023watermarking watermark, corresponding to $(\epsilon,\delta)$-robustness from Eq. \ref{['eq:robustness']}. The provider uses a Llama3.1-70b model, whereas our attacker's models are up to $46\times$ smaller. Non-adaptive attacks are marked by circles (), adaptive attacks by squares (). Notation "Ours-Qwen-3b-Exp" means that we evaluate our attack using a Qwen2.5-3b model that was adaptively optimized against the Exp watermark.
Figure 5: (Left) The cumulative density of p-values on the Dist-Shift watermark (green), a vanilla Llama2-7b paraphraser (blue) and our adaptive Llama2-7b paraphraser (red). (Right) The median p-value relative to the text token length with a threshold of $\rho=0.01$ (dashed line).
...and 11 more figures

Optimizing Adaptive Attacks against Watermarks for Language Models

TL;DR

Abstract

Optimizing Adaptive Attacks against Watermarks for Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (16)