Table of Contents
Fetching ...

Watermark Smoothing Attacks against Language Models

Hongyan Chang, Hamed Hassani, Reza Shokri

TL;DR

The paper analyzes watermarking for language models and shows that watermark detectability and text quality are constrained by the model’s confidence. It introduces the Smoothing Attack, which uses confidence estimates to selectively replace or retain tokens, effectively erasing watermark traces while preserving or enhancing text quality. Across ten watermarking schemes and multiple open-source models, the attack achieves substantial watermark removal, often outperforming paraphrasing-based defenses, and demonstrates the need for more robust watermark defenses. The findings have practical implications for AI-safety safeguards and prompt a reevaluation of watermark design under realistic adversarial access to confidence information.

Abstract

Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from $1.3$B to $30$B parameters on $10$ different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.

Watermark Smoothing Attacks against Language Models

TL;DR

The paper analyzes watermarking for language models and shows that watermark detectability and text quality are constrained by the model’s confidence. It introduces the Smoothing Attack, which uses confidence estimates to selectively replace or retain tokens, effectively erasing watermark traces while preserving or enhancing text quality. Across ten watermarking schemes and multiple open-source models, the attack achieves substantial watermark removal, often outperforming paraphrasing-based defenses, and demonstrates the need for more robust watermark defenses. The findings have practical implications for AI-safety safeguards and prompt a reevaluation of watermark design under realistic adversarial access to confidence information.

Abstract

Watermarking is a key technique for detecting AI-generated text. In this work, we study its vulnerabilities and introduce the Smoothing Attack, a novel watermark removal method. By leveraging the relationship between the model's confidence and watermark detectability, our attack selectively smoothes the watermarked content, erasing watermark traces while preserving text quality. We validate our attack on open-source models ranging from B to B parameters on different watermarks, demonstrating its effectiveness. Our findings expose critical weaknesses in existing watermarking schemes and highlight the need for stronger defenses.
Paper Structure (63 sections, 48 equations, 8 figures, 10 tables)

This paper contains 63 sections, 48 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: The correlations among $S_t$ (watermark contribution score), $\mathbb{E}_{v \sim P_t}\left[\mathbf{1}\{v \in \mathcal{G}_t\}\right]$ (expected number of green tokens from the un-watermarked model), and $\|P_t\|^2$ (model confidence), evaluated on model OPT-1.3B with the Red-green list watermark with $\gamma=0.5$ and $\delta=1.0$. The values are computed from different prefixes, constructed from the text from the Wikipedia article about Harry Potter.
  • Figure 2: The correlation between $S_t$ (watermark contribution score) and $\|P_t\|^2$ (model confidence) evaluated on model OPT-1.3B with the Gumbel and Tournament sampling (with $m$ tournaments) watermarks, using the same setup as in Figure \ref{['fig:kwg_l2_norm']}. Each sample corresponds to a specific prefix and secret key. $\|P_t\|^2$ is computed from the original un-watermarked model. The overall observation is similar to what we have for the Green-red list watermarking: $S_t$ decreases as $\|P_t\|^2$ increases.
  • Figure 3: The correlation between $D_{TV}(P_t,\widetilde{P}_t)$, i.e., the negative impact on text quality due to watermarks (in color blue), and $\|P_t\|^2$ measured on the OPT-1.3B with the Green-red list, Gumbel sampling, and Tournament sampling watermarks. We also plot $D_{TV}(P_t,P^{\text{ref}}_t)$, which measures the negative impact on text quality if we alternatively sample from the reference model OPT-125M (in color red).
  • Figure 4: The correlation between $D_{TV}(P_t,\widetilde{P}_t)$, i.e., the negative impact of watermarking on the text quality, and $S_t$, i.e., the token-level contribution to watermark detectability. We measure the correlation on the OPT-1.3B model. For all three watermarking schemes, $D_{TV}(P_t,\widetilde{P}_t)$ increases as $S_t$ increases.
  • Figure 5: Each subfigure shows how the true positive rate (TPR) varies with perplexity (PPL) for a specific attack. No attack (a) corresponds to watermarked text without modifications, paraphrasing (b) uses GPT-3.5-turbo to rewrite the text, and smoothing (c) randomly replaces some tokens to remove the watermark. Colors indicate the particular watermarking method and each point corresponds to one of three models (OPT-1.3B, Llama3-8B, Qwen2-1.5B).
  • ...and 3 more figures