A Certified Robust Watermark For Large Language Models

Xianheng Feng; Jian Liu; Kui Ren; Chun Chen

A Certified Robust Watermark For Large Language Models

Xianheng Feng, Jian Liu, Kui Ren, Chun Chen

TL;DR

This work proposes the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text and shows comparable performance to baseline algorithms while its algorithm can derive substantial certified robustness.

Abstract

The effectiveness of watermark algorithms in AI-generated text identification has garnered significant attention. Concurrently, an increasing number of watermark algorithms have been proposed to enhance the robustness against various watermark attacks. However, these watermark algorithms remain susceptible to adaptive or unseen attacks. To address this issue, to our best knowledge, we propose the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text. Specifically, we utilize two different models respectively for watermark generation and detection and add Gaussian and Uniform noise respectively in the embedding and permutation space during the training and inference stages of the watermark detector to enhance the certified robustness of our watermark detector and derive certified radius. To evaluate the empirical robustness and certified robustness of our watermark algorithm, we conducted comprehensive experiments. The results indicate that our watermark algorithm shows comparable performance to baseline algorithms while our algorithm can derive substantial certified robustness, which means that our watermark can not be removed even under significant alterations.

A Certified Robust Watermark For Large Language Models

TL;DR

Abstract

Paper Structure (20 sections, 6 equations, 7 figures, 4 tables, 5 algorithms)

This paper contains 20 sections, 6 equations, 7 figures, 4 tables, 5 algorithms.

Introduction
PRELIMINARIES
Watermark Algorithm
Watermark Attacks
Certifiably Robust Watermark Detector
PROPOSED METHOD
Randomized Smoothing
Training Stage of Watermark Detector
Encode Strategy
Green Token Selection
The Framework
Experiment
Experiment Setup
Experiment Results
Training Result
...and 5 more sections

Figures (7)

Figure 1: An overview of our certified robust watermarking algorithm. We utilize two different neural networks for watermark generation and detection. By adding Gaussian and Uniform noise during both training and inference stages, we improve the certified robustness of watermark algorithm and we are able to provide provable guarantees for watermarked text.
Figure 2: The perturbation on embedding and permutation space under different text attacks.
Figure 3: Certified accuracy under different noise parameters setting.(a)(b) and (c)(d) are respectively the certify accuracy over embedding and permutation space.
Figure 4: The true positive rate and true negative rate under each combination of noise parameters.
Figure 5: (a) The PDF and CDF of embeddings' $l_2$ norm of all tokens. (b) The PDF and CDF of embeddings' $l_2$ norm distance between each token.
...and 2 more figures

A Certified Robust Watermark For Large Language Models

TL;DR

Abstract

A Certified Robust Watermark For Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)