Enhancing LLM Watermark Resilience Against Both Scrubbing and Spoofing Attacks
Huanming Shen, Baizhou Huang, Xiaojun Wan
TL;DR
The paper tackles the vulnerability of LLM watermarking to scrubbing and spoofing by addressing the trade-off imposed by watermark window size. It introduces SEEK, a watermarking scheme based on Sub-vocabulary decomposed Equivalent Texture Keys, which creates redundancy in detection tokens to improve scrubbing robustness without weakening spoofing resistance. The approach yields a Pareto improvement over prior KGW-based methods, delivering substantial gains in spoofing and scrubbing robustness across multiple datasets while preserving generation quality. This work advances practical watermarking for provenance and misuse prevention in LLM outputs, with broad implications for secure and trusted AI deployments.
Abstract
Watermarking is a promising defense against the misuse of large language models (LLMs), yet it remains vulnerable to scrubbing and spoofing attacks. This vulnerability stems from an inherent trade-off governed by watermark window size: smaller windows resist scrubbing better but are easier to reverse-engineer, enabling low-cost statistics-based spoofing attacks. This work breaks this trade-off by introducing a novel mechanism, equivalent texture keys, where multiple tokens within a watermark window can independently support the detection. Based on the redundancy, we propose a novel watermark scheme with Sub-vocabulary decomposed Equivalent tExture Key (SEEK). It achieves a Pareto improvement, increasing the resilience against scrubbing attacks without compromising robustness to spoofing. Experiments demonstrate SEEK's superiority over prior method, yielding spoofing robustness gains of +88.2%/+92.3%/+82.0% and scrubbing robustness gains of +10.2%/+6.4%/+24.6% across diverse dataset settings.
