Watermark Smoothing Attacks against Language Models
Hongyan Chang, Hamed Hassani, Reza Shokri
TL;DR
The paper analyzes watermarking for language models and shows that watermark detectability and text quality are constrained by the model’s confidence. It introduces the Smoothing Attack, which uses confidence estimates to selectively replace or retain tokens, effectively erasing watermark traces while preserving or enhancing text quality. Across ten watermarking schemes and multiple open-source models, the attack achieves substantial watermark removal, often outperforming paraphrasing-based defenses, and demonstrates the need for more robust watermark defenses. The findings have practical implications for AI-safety safeguards and prompt a reevaluation of watermark design under realistic adversarial access to confidence information.
Abstract
Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from $1.3$B to $30$B parameters on $10$ different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.
