An Unforgeable Publicly Verifiable Watermark for Large Language Models

Aiwei Liu; Leyi Pan; Xuming Hu; Shu'ang Li; Lijie Wen; Irwin King; Philip S. Yu

An Unforgeable Publicly Verifiable Watermark for Large Language Models

Aiwei Liu, Leyi Pan, Xuming Hu, Shu'ang Li, Lijie Wen, Irwin King, Philip S. Yu

TL;DR

This work tackles the challenge of publicly verifiable text watermarking for large language models by introducing UPV, a framework that separates watermark generation and detection into two neural networks while sharing token embeddings to maintain efficiency. UPV enables public detection without exposing the watermark generation key and argues for unforgeability via computational asymmetry between the detector-to-generator directions. Empirical results across GPT-2, OPT, and LLaMA-7B on multiple datasets show near-baseline detection performance with minimal false positives and negligible impact on text quality or decoding speed. The approach provides a practical, secure option for detecting machine-generated text at scale with publicly accessible detectors.

Abstract

Recently, text watermarking algorithms for large language models (LLMs) have been proposed to mitigate the potential harms of text generated by LLMs, including fake news and copyright issues. However, current watermark detection algorithms require the secret key used in the watermark generation process, making them susceptible to security breaches and counterfeiting during public detection. To address this limitation, we propose an unforgeable publicly verifiable watermark algorithm named UPV that uses two different neural networks for watermark generation and detection, instead of using the same key at both stages. Meanwhile, the token embedding parameters are shared between the generation and detection networks, which makes the detection network achieve a high accuracy very efficiently. Experiments demonstrate that our algorithm attains high detection accuracy and computational efficiency through neural networks. Subsequent analysis confirms the high complexity involved in forging the watermark from the detection network. Our code is available at \href{https://github.com/THU-BPM/unforgeable_watermark}{https://github.com/THU-BPM/unforgeable\_watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.

An Unforgeable Publicly Verifiable Watermark for Large Language Models

TL;DR

Abstract

Paper Structure (28 sections, 13 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 28 sections, 13 equations, 5 figures, 8 tables, 1 algorithm.

Introduction
Related work
Problem definition
Proposed Method
Watermarked Large Language Model
Watermark Generation Network
Watermark Detection
Watermark Detection Network
Analysis of the Unforgeability
Experiment
Experiment Setup
Main Results
Analysis of Shared Embedding
Unforgeability Analysis
Hyper-parameter and Error Analysis
...and 13 more sections

Figures (5)

Figure 1: The left segment outlines the token logits generation process of the watermarked language model. Initially, the origin model generates token logits; these are then refined by the watermark generator to increase the probability of watermarked tokens (denoted in green). Operating within a predefined token window, this generator network (center) determines the final token's watermark label. The watermark detector (right) evaluates the entire text to ascertain watermark presence.
Figure 2: The left figure shows the detection success rate of our unforgeable publicly verifiable watermark and the success rate of two attack algorithms on the watermark under different window sizes. The right figure shows how watermark detection F1 score and generated text quality (measured by text perplexity) change as $\delta$ increases.
Figure 3: The left figure is an error analysis, illustrating the detection F1 score for data within various ranges of z-scores. The right figure depicts the changes in loss and the mean proportion ($\pm$ standard deviation) of watermarked tokens generated by the watermark generator network during training.
Figure 4: The left figure depicts the relationship between the different amount of data for training and the achievable cracking F1 score under a reverse training setting. The right figure demonstrates the effectiveness of watermark forgery at various cracking F1 scores.
Figure 5: Variation in attack success rate with increasing data volume for window sizes of 3 and 4

An Unforgeable Publicly Verifiable Watermark for Large Language Models

TL;DR

Abstract

An Unforgeable Publicly Verifiable Watermark for Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)