Table of Contents
Fetching ...

Can Watermarked LLMs be Identified by Users via Crafted Prompts?

Aiwei Liu, Sheng Guan, Yiming Liu, Leyi Pan, Yifei Zhang, Liancheng Fang, Lijie Wen, Philip S. Yu, Xuming Hu

TL;DR

This work designs an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM, and introduces the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.

Abstract

Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.

Can Watermarked LLMs be Identified by Users via Crafted Prompts?

TL;DR

This work designs an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM, and introduces the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.

Abstract

Text watermarking for Large Language Models (LLMs) has made significant progress in detecting LLM outputs and preventing misuse. Current watermarking techniques offer high detectability, minimal impact on text quality, and robustness to text editing. However, current researches lack investigation into the imperceptibility of watermarking techniques in LLM services. This is crucial as LLM providers may not want to disclose the presence of watermarks in real-world scenarios, as it could reduce user willingness to use the service and make watermarks more vulnerable to attacks. This work is the first to investigate the imperceptibility of watermarked LLMs. We design an identification algorithm called Water-Probe that detects watermarks through well-designed prompts to the LLM. Our key motivation is that current watermarked LLMs expose consistent biases under the same watermark key, resulting in similar differences across prompts under different watermark keys. Experiments show that almost all mainstream watermarking algorithms are easily identified with our well-designed prompts, while Water-Probe demonstrates a minimal false positive rate for non-watermarked LLMs. Finally, we propose that the key to enhancing the imperceptibility of watermarked LLMs is to increase the randomness of watermark key selection. Based on this, we introduce the Water-Bag strategy, which significantly improves watermark imperceptibility by merging multiple watermark keys.
Paper Structure (36 sections, 1 theorem, 32 equations, 5 figures, 11 tables, 1 algorithm)

This paper contains 36 sections, 1 theorem, 32 equations, 5 figures, 11 tables, 1 algorithm.

Key Result

Theorem 1

Let $x_1$ and $x_2$ be two different prompts satisfying the similarity condition in Equation eq:correlated-prompts. Let $k_1$ and $k_2$ be two randomly sampled watermark keys from the key space $\mathcal{K}$. The effect of applying these keys on the output distribution should be highly consistent ac where $P_M^F$ is the watermarked distribution, and $\text{Sim}(\cdot, \cdot)$ is a similarity measu

Figures (5)

  • Figure 1: Illustration of our Water-Probe algorithm for identifying watermarked LLMs. We first construct two prompts with similar output distributions, then sample repeatedly using two fixed watermark keys for each prompt. The presence of a watermark is determined by comparing the similarity of distribution differences between the two prompts. Details in \ref{['sec:method']}.
  • Figure 2: Distribution of start keys for identical prefixes in Exp-Edit watermarking. Analysis based on prompts described in Section \ref{['sec:repeated-sampling']} for Watermark-Probe-v2. Each subplot represents a specific prefix(in title).
  • Figure 3: The left plot shows the variation of z-scores detected by Watermark-Probe-v1 and Watermark-Probe-v2 as a function of sampling temperature. The right plot illustrates the change in z-scores detected by Watermark-Probe-v1 and Watermark-Probe-v2 with different sampling numbers.
  • Figure 4: Distribution of start keys for identical prefixes in Water-Bag strategy. Analysis based on prompts described in Section \ref{['sec:repeated-sampling']} for Water-Probe-v2. Each subplot represents a specific prefix (showed in title).
  • Figure 5: The variation of z-scores at different temperatures when calculating similarity without using rank transformation in Equation \ref{['eq:average-similarity']}.

Theorems & Definitions (9)

  • Definition 1: Large Language Model
  • Definition 2: Watermark Rule
  • Definition 3: N-Gram Based Watermarking
  • Definition 4: Fixed-Key-List Based Watermarking
  • Definition 5: Black-box Watermark Identification
  • Definition 6: Distortion-Free Watermark
  • Theorem 1: Consistency of Watermark Effect
  • Definition 7: Water-Bag Strategy
  • proof