Table of Contents
Fetching ...

Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

Huanming Shen, Baizhou Huang, Xiaojun Wan

TL;DR

The paper tackles the vulnerability of LLM watermarking to scrubbing and spoofing by addressing the trade-off imposed by watermark window size. It introduces SEEK, a watermarking scheme based on Sub-vocabulary decomposed Equivalent Texture Keys, which creates redundancy in detection tokens to improve scrubbing robustness without weakening spoofing resistance. The approach yields a Pareto improvement over prior KGW-based methods, delivering substantial gains in spoofing and scrubbing robustness across multiple datasets while preserving generation quality. This work advances practical watermarking for provenance and misuse prevention in LLM outputs, with broad implications for secure and trusted AI deployments.

Abstract

Watermarking is a promising defense against the misuse of large language models (LLMs), yet it remains vulnerable to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work breaks this trade-off by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a novel watermark scheme with Sub-vocabulary decomposed Equivalent tExture Key (SEEK). It achieves a Pareto improvement, increasing the resilience against scrubbing attacks without compromising robustness to spoofing. Experiments demonstrate SEEK's superiority over prior method, yielding spoofing robustness gains of +88.2%/+92.3%/+82.0% and scrubbing robustness gains of +10.2%/+6.4%/+24.6% across diverse dataset settings.

Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks

TL;DR

The paper tackles the vulnerability of LLM watermarking to scrubbing and spoofing by addressing the trade-off imposed by watermark window size. It introduces SEEK, a watermarking scheme based on Sub-vocabulary decomposed Equivalent Texture Keys, which creates redundancy in detection tokens to improve scrubbing robustness without weakening spoofing resistance. The approach yields a Pareto improvement over prior KGW-based methods, delivering substantial gains in spoofing and scrubbing robustness across multiple datasets while preserving generation quality. This work advances practical watermarking for provenance and misuse prevention in LLM outputs, with broad implications for secure and trusted AI deployments.

Abstract

Watermarking is a promising defense against the misuse of large language models (LLMs), yet it remains vulnerable to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work breaks this trade-off by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a novel watermark scheme with Sub-vocabulary decomposed Equivalent tExture Key (SEEK). It achieves a Pareto improvement, increasing the resilience against scrubbing attacks without compromising robustness to spoofing. Experiments demonstrate SEEK's superiority over prior method, yielding spoofing robustness gains of +88.2%/+92.3%/+82.0% and scrubbing robustness gains of +10.2%/+6.4%/+24.6% across diverse dataset settings.

Paper Structure

This paper contains 38 sections, 7 theorems, 50 equations, 9 figures, 13 tables, 3 algorithms.

Key Result

Proposition 4.1

Given a hash function with space dimension $d$ and a watermark window size $h$, the probability of a hash collision occurring can be approximated by $p(h, d) \geq 1 - e^{\frac{-h(h-1)}{2d}}$.

Figures (9)

  • Figure 1: Performance of different schemes under scrubbing and spoofing attacks. Varying the watermark window size induces a trade-off between scrubbing and spoofing robustness. Scrubbing robustness is evaluated using DIPPER dipper on the C4-RealNewsLike dataset. Spoofing robustness is evaluated using statistics-based attacks steal on the Dolly-CW dataset. Ours achieves improved robustness on both axes, reaching Pareto optimality.
  • Figure 2: Performance of different watermark window schemes under spoofing attack. The attack is conducted using 500 malicious texts generated by Dolly-CW, targeting a calibrated detector under FPR of 0.1%.
  • Figure 3: An analysis of generation quality across different schemes on a subset of C4 by Log Diversity metric, relative to the unwatermarked text.
  • Figure 3: Performance comparison of various SEEK parameter schemes under the DIPPER-$\mathrm{I}$ attack on a C4-Eval subset. $h$ and $d$ denote the watermark window size and the cardinality of the hash space, respectively.
  • Figure 4: (A) KGW-MIN with equivalent texture keys proposed in Section \ref{['sec:motivation']}. Each value in hash space derives a distinct $\theta^{i}$ to generate a partition of vocabulary. We then select one as the final green list. (B) SEEK proposed in Section \ref{['sec:seek']}. Different from (A), each value in the hash space only contributes to a partition of a sub-vocabulary $G^{i}$. We then merge all partitions as the final green list.
  • ...and 4 more figures

Theorems & Definitions (7)

  • Proposition 4.1
  • Proposition 4.2
  • Proposition C.1
  • Proposition C.2
  • Proposition D.1
  • Proposition D.2
  • Proposition D.3