Table of Contents
Fetching ...

Watermark under Fire: A Robustness Evaluation of LLM Watermarking

Jiacheng Liang, Zian Wang, Lauren Hong, Shouling Ji, Ting Wang

TL;DR

This work tackles the problem of evaluating robustness in LLM watermarking by introducing WaterPark, an open-source platform that unifies 12 watermarkers, 12 watermark-removal attacks, and 8 evaluation metrics. It conducts comprehensive observational and controlled analyses across multiple LLMs and task domains, revealing how design choices—such as context-dependency and generation strategy—drive robustness and fidelity trade-offs. The findings show that context-free, distribution-transform watermarks often offer stronger attack resilience at the cost of fidelity, while text-dependent, soft-perturbation methods preserve quality but are more attack-prone; combining detectors and utilizing surrogate attacks further illuminate practical defense and attack dynamics. The work provides actionable deployment guidelines and a shared benchmark to advance robust watermarking research in real-world adversarial settings.

Abstract

Various watermarking methods (``watermarkers'') have been proposed to identify LLM-generated texts; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, by leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. We further explore the best practices to operate watermarkers in adversarial environments. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research.

Watermark under Fire: A Robustness Evaluation of LLM Watermarking

TL;DR

This work tackles the problem of evaluating robustness in LLM watermarking by introducing WaterPark, an open-source platform that unifies 12 watermarkers, 12 watermark-removal attacks, and 8 evaluation metrics. It conducts comprehensive observational and controlled analyses across multiple LLMs and task domains, revealing how design choices—such as context-dependency and generation strategy—drive robustness and fidelity trade-offs. The findings show that context-free, distribution-transform watermarks often offer stronger attack resilience at the cost of fidelity, while text-dependent, soft-perturbation methods preserve quality but are more attack-prone; combining detectors and utilizing surrogate attacks further illuminate practical defense and attack dynamics. The work provides actionable deployment guidelines and a shared benchmark to advance robust watermarking research in real-world adversarial settings.

Abstract

Various watermarking methods (``watermarkers'') have been proposed to identify LLM-generated texts; yet, due to the lack of unified evaluation platforms, many critical questions remain under-explored: i) What are the strengths/limitations of various watermarkers, especially their attack robustness? ii) How do various design choices impact their robustness? iii) How to optimally operate watermarkers in adversarial environments? To fill this gap, we systematize existing LLM watermarkers and watermark removal attacks, mapping out their design spaces. We then develop WaterPark, a unified platform that integrates 10 state-of-the-art watermarkers and 12 representative attacks. More importantly, by leveraging WaterPark, we conduct a comprehensive assessment of existing watermarkers, unveiling the impact of various design choices on their attack robustness. We further explore the best practices to operate watermarkers in adversarial environments. We believe our study sheds light on current LLM watermarking techniques while WaterPark serves as a valuable testbed to facilitate future research.

Paper Structure

This paper contains 47 sections, 9 equations, 14 figures, 10 tables.

Figures (14)

  • Figure 1: Illustration of LLM watermarking and watermark removal attacks.
  • Figure 2: Quality preservation of different attacks.
  • Figure 3: Watermarker robustness to multi-attacks. a) Context dependency: TGRL (text-dependent) and UG (context-free); b) Generation strategy: TGRL (distribution-shift) and GO (distribution-transform); c) Detection method: UPV (Model-based) and UPV$_{stat}$ (Score-based). RDF and GO.
  • Figure 4: Detection of watermarked texts by watermarker-specific and generic detectors ('1' or '0' indicate that the detector detects the given watermarked text as watermarked or non-watermarked).
  • Figure 5: Attacks leveraging surrogate detectors.
  • ...and 9 more figures