Table of Contents
Fetching ...

Watermarking Diffusion Language Models

Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev

TL;DR

This work introduces the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially.

Abstract

We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.

Watermarking Diffusion Language Models

TL;DR

This work introduces the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially.

Abstract

We introduce the first watermark tailored for diffusion language models (DLMs), an emergent LLM paradigm able to generate tokens in arbitrary order, in contrast to standard autoregressive language models (ARLMs) which generate tokens sequentially. While there has been much work in ARLM watermarking, a key challenge when attempting to apply these schemes directly to the DLM setting is that they rely on previously generated tokens, which are not always available with DLM generation. In this work we address this challenge by: (i) applying the watermark in expectation over the context even when some context tokens are yet to be determined, and (ii) promoting tokens which increase the watermark strength when used as context for other tokens. This is accomplished while keeping the watermark detector unchanged. Our experimental evaluation demonstrates that the DLM watermark leads to a >99% true positive rate with minimal quality impact and achieves similar robustness to existing ARLM watermarks, enabling for the first time reliable DLM watermarking.

Paper Structure

This paper contains 103 sections, 3 theorems, 52 equations, 27 figures, 4 tables, 3 algorithms.

Key Result

Theorem 3.1

Given $p \in \Delta(\Sigma)^L$ and $J$ defined in eq:energy_function, there exists $\delta \in \mathbb{R}^L$ such that with $\alpha_t(q) = \nabla_{q_t} J(q)$. Moreover, for all $t \in [1,\ldots,L]$, $\delta_t$ is the unique solution to $\text{KL}(q^*_t,p_t) = \varepsilon$.

Figures (27)

  • Figure 1: An overview of why current watermarks for ARLMs fall short in the diffusion setting (left), how our watermark operates in this setting (middle) and how our watermark detector works (right).
  • Figure 2: Detection Performance of Our Approach(Left) We compare the trade-off between watermark detectability (TPR@1) and text quality (log PPL) of our approach and the baseline for different values of the watermark strength parameter $\delta$ and sequences of, on average, 275 tokens. (Right) For $\delta = 4$, we compare watermark detectability (TPR@1) between our approach and the baseline as a function of text length. Responses are generated by Llada-8B with temperature $0.5$ and $600$ prompts from WaterBench. Crosses represent shared parameters between both figures.
  • Figure 3: Robustness Evaluation of Our Watermark (Left) We measure the detectability of our watermark (TPR@1) against an increasing percentage of local modifications, using responses generated from Llada-8B with an average length of $275$ tokens. (Right) For stronger adversaries, we measure the detectability of our watermark (TPR@1) with respect to the length of the sequence. For both figures, we use $\delta = 4$ and the previous token as context ($\mathcal{C}=\{-1\}$).
  • Figure 4: Ablation of Our Watermark Components We compare the trade-off between watermark detectability (TPR@1) and text quality (log PPL) of our approach with various hyperparameters, namely the hashing scheme (Top Left), the two components introduced in \ref{['ssec:method:interpretation']} (Top Right), the number of fixed-point iterations (Bottom Left) and the $\varepsilon$/$\delta$-parameterization explained in \ref{['ssec:diffusion_lm_wm_instantiation']} (Bottom Right). Responses are generated by Llada-8B with temperature $0.5$ and $600$ prompts.
  • Figure 5: Detection Performance Comparison with Order-Agnostic Watermarks We study the trade-off between detectability (TPR@1) and text quality (log PPL) of our approach and order-agnostic watermarks for different values of the watermark strength parameter $\delta$ and sequences of, on average, $275$ tokens. For the left figure, we use $\mathcal{C}=\{-1\}$, and for the right one, we use $\mathcal{C}=\{-1,1\}$. For the order-agnostic watermarks, we use the same data for both figures.
  • ...and 22 more figures

Theorems & Definitions (6)

  • Theorem 3.1
  • proof
  • Theorem G.1
  • proof
  • Theorem J.1
  • proof