On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

Romina Omidi; Yun Dong; Binghui Wang

On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

Romina Omidi, Yun Dong, Binghui Wang

TL;DR

The first theoretical analysis of SynthID-Text is presented, with a focus on its detection performance and watermark robustness, complemented by empirical validation, to prove that the mean score is inherently vulnerable to increased tournament layers, and design a layer inflation attack to break SynthID-Text.

Abstract

Google's SynthID-Text, the first ever production-ready generative watermark system for large language model, designs a novel Tournament-based method that achieves the state-of-the-art detectability for identifying AI-generated texts. The system's innovation lies in: 1) a new Tournament sampling algorithm for watermarking embedding, 2) a detection strategy based on the introduced score function (e.g., Bayesian or mean score), and 3) a unified design that supports both distortionary and non-distortionary watermarking methods. This paper presents the first theoretical analysis of SynthID-Text, with a focus on its detection performance and watermark robustness, complemented by empirical validation. For example, we prove that the mean score is inherently vulnerable to increased tournament layers, and design a layer inflation attack to break SynthID-Text. We also prove the Bayesian score offers improved watermark robustness w.r.t. layers and further establish that the optimal Bernoulli distribution for watermark detection is achieved when the parameter is set to 0.5. Together, these theoretical and empirical insights not only deepen our understanding of SynthID-Text, but also open new avenues for analyzing effective watermark removal strategies and designing robust watermarking techniques. Source code is available at https: //github.com/romidi80/Synth-ID-Empirical-Analysis.

On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

TL;DR

Abstract

Paper Structure (37 sections, 23 theorems, 143 equations, 3 figures, 1 table)

This paper contains 37 sections, 23 theorems, 143 equations, 3 figures, 1 table.

Introduction
Background
SynthID-Text for Generative LLM Watermarking
$g$-Value Function/Distribution, Score Function, and Detection Metric
Preliminaries
Theoretical Analysis
Mean Score
Bayesian Score
Empirical Evaluations
Empirical Validation for Our TPR Trend
Empirical Validity of the CLT Assumption
Layer Inflation Attack
Discussions, Limitations and Future Work
Related Work
Conclusion
...and 22 more sections

Key Result

Theorem 1

Let $C_{l,t}$ be the collision probability w.r.t layer $l$ at $t$-th token and $F_g$ the CDF of the unwatermarked $g$-value distribution $f_g$. The CDF $F_{gw}$ of the watermarked g-value distribution $f_{gw}$ is given by: where if $g$ is continuous, the PDF $f_{gw}$ is given by: and if $g$ is discrete, the PMF $f_{gw}$ is given by:

Figures (3)

Figure 1: Left: Overview of non-distortionary SynthID-Text's Tournament-based watermarking with $m$ layers; Right: Layer inflation attack by appending the SynthID-Text's Tournament sampling with $N$ layers to remove the watermark.
Figure 2: (a)-(c) show the TPR Trend on GPT-2B, Gemma-7B, and Mistral-7B, respectively.
Figure 3: Gaussian distribution fitting of mean scores on the three models.

Theorems & Definitions (26)

Definition 1: True Positive Rate (TPR) van2004detection
Definition 2: Collision probabilities dathathri2024scalable
Theorem 1: Watermarked $g$-value distribution for single-layer tournament (dathathri2024scalable)
Theorem 2: Bayesian likelihoods for $m$-layer Tournament sampling dathathri2024scalable
Proposition 1: $\text{TPR}@\text{FPR}=\epsilon$ for normally distributed MS
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Theorem 7
...and 16 more

On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

TL;DR

Abstract

On Google's SynthID-Text LLM Watermarking System: Theoretical Analysis and Empirical Validation

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (26)