Table of Contents
Fetching ...

Let Watermarks Speak: A Robust and Unforgeable Watermark for Language Models

Minhao Bai

TL;DR

This work tackles the challenge of watermarking language-model outputs with both robustness and unforgeability. It presents a novel single-bit watermarking scheme that can embed two signals using Wat-Sampler and Dual Inverse Transform Sampling (DITS), preserving the model's output distribution. It then extends to a multi-bit scheme through a hash-linked watermark chain, achieving prefix-unforgeability under collision-resistance assumptions and providing formal guarantees for correctness, undetectability, and robustness, complemented by empirical validation on popular language models. The approach offers a practical, scalable path toward verifiable watermarking with integrity checks suitable for real-world deployment.

Abstract

Watermarking is an effective way to trace model-generated content. Current watermark methods cannot resist forgery attacks, such as a deceptive claim that the model-generated content is a response to a fabricated prompt. None of them can be made unforgeable without degrading robustness. Unforgeability demands that the watermarked output is not only detectable but also verifiable for integrity, indicating whether it has been modified. This underscores the necessity and significance of a multi-bit watermarking scheme. Recent works try to build multi-bit scheme based on existing zero-bit watermarking scheme, but they either degrades the robustness or brings a significant computational burden. We aim to design a novel single-bit watermark scheme, which provides the ability to embed 2 different watermark signals. This paper's main contribution is that we are the first to propose an undetectable, robust, single-bit watermarking scheme. It has a comparable robustness to the most advanced zero-bit watermarking schemes. Then we construct a multi-bit watermarking scheme to use the hash value of prompt or the newest generated content as the watermark signals, and embed them into the following content, which guarantees the unforgeability. Additionally, we provide sufficient experiments on some popular language models, while the other advanced methods with provable guarantees do not often provide. The results show that our method is practically effective and robust.

Let Watermarks Speak: A Robust and Unforgeable Watermark for Language Models

TL;DR

This work tackles the challenge of watermarking language-model outputs with both robustness and unforgeability. It presents a novel single-bit watermarking scheme that can embed two signals using Wat-Sampler and Dual Inverse Transform Sampling (DITS), preserving the model's output distribution. It then extends to a multi-bit scheme through a hash-linked watermark chain, achieving prefix-unforgeability under collision-resistance assumptions and providing formal guarantees for correctness, undetectability, and robustness, complemented by empirical validation on popular language models. The approach offers a practical, scalable path toward verifiable watermarking with integrity checks suitable for real-world deployment.

Abstract

Watermarking is an effective way to trace model-generated content. Current watermark methods cannot resist forgery attacks, such as a deceptive claim that the model-generated content is a response to a fabricated prompt. None of them can be made unforgeable without degrading robustness. Unforgeability demands that the watermarked output is not only detectable but also verifiable for integrity, indicating whether it has been modified. This underscores the necessity and significance of a multi-bit watermarking scheme. Recent works try to build multi-bit scheme based on existing zero-bit watermarking scheme, but they either degrades the robustness or brings a significant computational burden. We aim to design a novel single-bit watermark scheme, which provides the ability to embed 2 different watermark signals. This paper's main contribution is that we are the first to propose an undetectable, robust, single-bit watermarking scheme. It has a comparable robustness to the most advanced zero-bit watermarking schemes. Then we construct a multi-bit watermarking scheme to use the hash value of prompt or the newest generated content as the watermark signals, and embed them into the following content, which guarantees the unforgeability. Additionally, we provide sufficient experiments on some popular language models, while the other advanced methods with provable guarantees do not often provide. The results show that our method is practically effective and robust.
Paper Structure (19 sections, 1 theorem, 27 equations, 3 figures, 7 algorithms)

This paper contains 19 sections, 1 theorem, 27 equations, 3 figures, 7 algorithms.

Key Result

Lemma 1

$X_1, X_2, \cdot\cdot\cdot , X_n$ are independent and identical random variables, and each $X_i$ is bounded by $[0,1]$. Let $X = \frac{1}{n}\sum_{i=1}^n X_i$ and $\mu = \mathbb{E}[X]$, the probability that the sample mean $X$ deviates from the theoretical mean $\mu$ up to $t$ is and

Figures (3)

  • Figure 1: An illustration of Wat-Sampler and DITS.
  • Figure 2: An illustration of Detect-1bit.
  • Figure 3: Games used in the proof of computational indistinguishability.

Theorems & Definitions (20)

  • Definition 1: Single-bit Watermarking Scheme
  • Claim 1: DITS Maintains the Distribution of Predictor
  • Lemma 1: Hoeffding's Inequality
  • Definition 2: Correctness
  • Claim 2
  • Definition 3: Undetectability
  • Claim 3
  • Definition 4: Hamming Ball
  • Definition 5: $(\gamma,\epsilon)$-Robustness of Single Watermark Output, against Substitution
  • Definition 6: $(\gamma,\epsilon)$-Robustness of a Watermark Scheme, against Substitution
  • ...and 10 more