Table of Contents
Fetching ...

Towards Better Statistical Understanding of Watermarking LLMs

Zhongze Cai, Shang Liu, Hanzhao Wang, Huaiyang Zhong, Xiaocheng Li

TL;DR

An online dual gradient ascent watermarking algorithm is developed and it is proved that its asymptotic Pareto optimality between model distortion and detection ability guarantees an averaged increased green list probability and henceforth detection ability explicitly.

Abstract

In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the green-red algorithm of Kirchenbauer et al. (2023a). We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better understanding and inspires the algorithm design for the watermarking process. We develop an online dual gradient ascent watermarking algorithm in light of this optimization formulation and prove its asymptotic Pareto optimality between model distortion and detection ability. Such a result guarantees an averaged increased green list probability and henceforth detection ability explicitly (in contrast to previous results). Moreover, we provide a systematic discussion on the choice of the model distortion metrics for the watermarking problem. We justify our choice of KL divergence and present issues with the existing criteria of ``distortion-free'' and perplexity. Finally, we empirically evaluate our algorithms on extensive datasets against benchmark algorithms.

Towards Better Statistical Understanding of Watermarking LLMs

TL;DR

An online dual gradient ascent watermarking algorithm is developed and it is proved that its asymptotic Pareto optimality between model distortion and detection ability guarantees an averaged increased green list probability and henceforth detection ability explicitly.

Abstract

In this paper, we study the problem of watermarking large language models (LLMs). We consider the trade-off between model distortion and detection ability and formulate it as a constrained optimization problem based on the green-red algorithm of Kirchenbauer et al. (2023a). We show that the optimal solution to the optimization problem enjoys a nice analytical property which provides a better understanding and inspires the algorithm design for the watermarking process. We develop an online dual gradient ascent watermarking algorithm in light of this optimization formulation and prove its asymptotic Pareto optimality between model distortion and detection ability. Such a result guarantees an averaged increased green list probability and henceforth detection ability explicitly (in contrast to previous results). Moreover, we provide a systematic discussion on the choice of the model distortion metrics for the watermarking problem. We justify our choice of KL divergence and present issues with the existing criteria of ``distortion-free'' and perplexity. Finally, we empirically evaluate our algorithms on extensive datasets against benchmark algorithms.
Paper Structure (50 sections, 15 theorems, 107 equations, 9 figures, 2 tables, 3 algorithms)

This paper contains 50 sections, 15 theorems, 107 equations, 9 figures, 2 tables, 3 algorithms.

Key Result

Proposition 2.3

For an LM $\bm{p}$, prompt $x \in \mathcal{X}$ and the watermarked LM $\bm{q}$ watermarked by algorithm $\mathcal{A}$ and key $\mathcal{K}$, the following decomposition holds

Figures (9)

  • Figure 1: The scatter plot of $z$-score v.s. realized DG for different algorithms. SRL stands for the algorithm in kirchenbauer2023watermark and DualGA stands for our Algorithm \ref{['alg:DualGA']} under different parameter combinations. Each point represents one generated sequence, and for each algorithm, 200 sequences are generated.
  • Figure 2: Detection ability (Realized DG) and distortion (KL) on the LFQA dataset at the population level (left) and the individual prompt level (right). Left: On the population level, the performances of some SRL and all DualGA configurations stay on the Pareto-optimal curve, illustrating an effective balance between detection ability and distortion. Exceptions notably falling below the curve include specific configurations in SRL ($\gamma=0.7,\delta=5$ and $\gamma=0.7,\delta=10$). Right: The DualGA algorithm demonstrates a consistent ability to ensure uniform detection across individual prompts.
  • Figure 3: The median $p$-values under different proportions of deleted tokens (Attack Rate) on the C4 dataset. The black dashed line represents $p=10^{-4}$.
  • Figure 4: Left: the repeated chunks of tokens generated by DualGA. Right: The abnormal consistent rise in ${\lambda_t}$.
  • Figure 5: Detection ability (Realized DG) and distortion (KL) on the C4 dataset at the population level (left) and the individual prompt level (right).
  • ...and 4 more figures

Theorems & Definitions (37)

  • Definition 2.1
  • Definition 2.2
  • Proposition 2.3
  • Proposition 2.4
  • Proposition 2.5
  • Lemma 2.6
  • Theorem 2.8
  • Proposition 3.1
  • Definition 3.2
  • Proposition 3.3
  • ...and 27 more