Watermarking Language Models with Error Correcting Codes

Patrick Chao; Yan Sun; Edgar Dobriban; Hamed Hassani

Watermarking Language Models with Error Correcting Codes

Patrick Chao, Yan Sun, Edgar Dobriban, Hamed Hassani

TL;DR

This work provides an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating $p$-values, and theoretical guarantees, and empirical findings suggest the watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

Abstract

Recent progress in large language models enables the creation of realistic machine-generated content. Watermarking is a promising approach to distinguish machine-generated text from human text, embedding statistical signals in the output that are ideally undetectable to humans. We propose a watermarking framework that encodes such signals through an error correcting code. Our method, termed robust binary code (RBC) watermark, introduces no noticeable degradation in quality. We evaluate our watermark on base and instruction fine-tuned models and find that our watermark is robust to edits, deletions, and translations. We provide an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating $p$-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

Watermarking Language Models with Error Correcting Codes

TL;DR

This work provides an information-theoretic perspective on watermarking, a powerful statistical test for detection and for generating

-values, and theoretical guarantees, and empirical findings suggest the watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

Abstract

-values, and theoretical guarantees. Our empirical findings suggest our watermark is fast, powerful, and robust, comparing favorably to the state-of-the-art.

Paper Structure (33 sections, 3 theorems, 12 equations, 9 figures, 8 tables, 8 algorithms)

This paper contains 33 sections, 3 theorems, 12 equations, 9 figures, 8 tables, 8 algorithms.

Introduction
Background
Error Correcting Codes
Reduction to Binary Vocabulary
Sampling Correlated Bits
Robust Binary Code Watermark
Simple Watermark
Full Watermark
Theoretical Results
Experimental Results
Discussion
Further Related Work
Bias-based watermarking.
Distortion-free and cryptographic watermarks.
Robustness and attacks.
...and 18 more sections

Key Result

Lemma 4.2

For $q\in[0,1]$, we have the inequality For $q_1,\ldots, q_n \in [0,1]$, we have the inequality

Figures (9)

Figure 1: An overview of our method: at each generation step, the binary conversion of the previous tokens $\Gamma(X_{(i-w_{in}):(i-1)})$ is combined with a random bit string $R$ via an exclusive-or operation to construct the message $M$. The message is then encoded using an error-correcting code (ECC) to produce the codeword $Y$. The binarized language model generates binary strings through the Correlated Binary Sampling Channel christ2024pseudorandom with codeword $Y$. Finally, the binary output is mapped back to the vocabulary space using the binary decoding function.
Figure 2: Bounds on $H(q)$ with functions of $\vert 1/2-q\vert$, for $q\in[0,1]$.
Figure 3: Watermarking performance of the base Llama-3-8B model with RBC using LDPC and one-to-one codes, and baseline methods from kirchenbauer2023watermarkhuo2024token, averaged over $100$ generations for ten prompts. Left: The mean log-$p$-values with standard errors shaded. Middle: The detection probability with standard errors shaded. Right: The mean perplexity of the generated texts with standard errors shaded.
Figure 4: Watermarking performance of the base Llama-3-8B model with RBC using LDPC and one-to-one codes, and baseline methods. Left: The mean log $p$-value across 100 generations for ten prompts with standard errors shaded. Right: The detection probability with $\alpha=10^{-6}$ with standard errors shaded. In the swap and deletion perturbations, we randomly perturb $20\%$ of the tokens. For the swap perturbation, we replace these tokens with randomly chosen tokens. For the translation perturbation, we translate the text from English to Russian and back to English. For the paraphrase perturbation, we paraphrase the text using the same Llama-3-8B model.
Figure 5: Empirical FPR vs P-value thresholds.
...and 4 more figures

Theorems & Definitions (8)

Definition 2.1
Definition 4.1: Entropy
Lemma 4.2
Theorem 4.3: Bounding the proportion of mismatches in a CSBC with the entropy
Theorem 4.4: Exact block decoding
proof
proof
proof

Watermarking Language Models with Error Correcting Codes

TL;DR

Abstract

Watermarking Language Models with Error Correcting Codes

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (8)