Table of Contents
Fetching ...

A Certified Robust Watermark For Large Language Models

Xianheng Feng, Jian Liu, Kui Ren, Chun Chen

TL;DR

This work proposes the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text and shows comparable performance to baseline algorithms while its algorithm can derive substantial certified robustness.

Abstract

The effectiveness of watermark algorithms in AI-generated text identification has garnered significant attention. Concurrently, an increasing number of watermark algorithms have been proposed to enhance the robustness against various watermark attacks. However, these watermark algorithms remain susceptible to adaptive or unseen attacks. To address this issue, to our best knowledge, we propose the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text. Specifically, we utilize two different models respectively for watermark generation and detection and add Gaussian and Uniform noise respectively in the embedding and permutation space during the training and inference stages of the watermark detector to enhance the certified robustness of our watermark detector and derive certified radius. To evaluate the empirical robustness and certified robustness of our watermark algorithm, we conducted comprehensive experiments. The results indicate that our watermark algorithm shows comparable performance to baseline algorithms while our algorithm can derive substantial certified robustness, which means that our watermark can not be removed even under significant alterations.

A Certified Robust Watermark For Large Language Models

TL;DR

This work proposes the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text and shows comparable performance to baseline algorithms while its algorithm can derive substantial certified robustness.

Abstract

The effectiveness of watermark algorithms in AI-generated text identification has garnered significant attention. Concurrently, an increasing number of watermark algorithms have been proposed to enhance the robustness against various watermark attacks. However, these watermark algorithms remain susceptible to adaptive or unseen attacks. To address this issue, to our best knowledge, we propose the first certified robust watermark algorithm for large language models based on randomized smoothing, which can provide provable guarantees for watermarked text. Specifically, we utilize two different models respectively for watermark generation and detection and add Gaussian and Uniform noise respectively in the embedding and permutation space during the training and inference stages of the watermark detector to enhance the certified robustness of our watermark detector and derive certified radius. To evaluate the empirical robustness and certified robustness of our watermark algorithm, we conducted comprehensive experiments. The results indicate that our watermark algorithm shows comparable performance to baseline algorithms while our algorithm can derive substantial certified robustness, which means that our watermark can not be removed even under significant alterations.
Paper Structure (20 sections, 6 equations, 7 figures, 4 tables, 5 algorithms)

This paper contains 20 sections, 6 equations, 7 figures, 4 tables, 5 algorithms.

Figures (7)

  • Figure 1: An overview of our certified robust watermarking algorithm. We utilize two different neural networks for watermark generation and detection. By adding Gaussian and Uniform noise during both training and inference stages, we improve the certified robustness of watermark algorithm and we are able to provide provable guarantees for watermarked text.
  • Figure 2: The perturbation on embedding and permutation space under different text attacks.
  • Figure 3: Certified accuracy under different noise parameters setting.(a)(b) and (c)(d) are respectively the certify accuracy over embedding and permutation space.
  • Figure 4: The true positive rate and true negative rate under each combination of noise parameters.
  • Figure 5: (a) The PDF and CDF of embeddings' $l_2$ norm of all tokens. (b) The PDF and CDF of embeddings' $l_2$ norm distance between each token.
  • ...and 2 more figures