Table of Contents
Fetching ...

Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

Mingjia Huo, Sai Ashish Somayajula, Youwei Liang, Ruisi Zhang, Farinaz Koushanfar, Pengtao Xie

TL;DR

The paper tackles the challenge of identifying AI-generated text by introducing token-specific watermarking for large language models, where per-token parameters $\gamma_t$ and $\delta_t$ are produced by two lightweight generators. It optimizes detectability and semantic coherence via a multi-objective framework using MGDA, with differentiable surrogates for the watermarks’ detectability and a SimCSE-based semantic loss. The approach yields a Pareto-optimal trade-off that outperforms prior methods (e.g., KGW, SWEET, SIR, EXP-edit) in both higher watermark detectability and preserved semantic quality, and it demonstrates robustness to paraphrase and copy-paste attacks. The work provides practical benefits for AI-output provenance and policy enforcement, with code available at the specified GitHub repository.

Abstract

Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the semantic quality of generated texts is challenging. While current watermarking algorithms have made promising progress in this direction, there remains significant scope for improvement. To address these challenges, we introduce a novel multi-objective optimization (MOO) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. By leveraging MOO to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. Experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by LLMs while maintaining their semantic coherence. Our code is available at https://github.com/mignonjia/TS_watermark.

Token-Specific Watermarking with Enhanced Detectability and Semantic Coherence for Large Language Models

TL;DR

The paper tackles the challenge of identifying AI-generated text by introducing token-specific watermarking for large language models, where per-token parameters and are produced by two lightweight generators. It optimizes detectability and semantic coherence via a multi-objective framework using MGDA, with differentiable surrogates for the watermarks’ detectability and a SimCSE-based semantic loss. The approach yields a Pareto-optimal trade-off that outperforms prior methods (e.g., KGW, SWEET, SIR, EXP-edit) in both higher watermark detectability and preserved semantic quality, and it demonstrates robustness to paraphrase and copy-paste attacks. The work provides practical benefits for AI-output provenance and policy enforcement, with code available at the specified GitHub repository.

Abstract

Large language models generate high-quality responses with potential misinformation, underscoring the need for regulation by distinguishing AI-generated and human-written texts. Watermarking is pivotal in this context, which involves embedding hidden markers in texts during the LLM inference phase, which is imperceptible to humans. Achieving both the detectability of inserted watermarks and the semantic quality of generated texts is challenging. While current watermarking algorithms have made promising progress in this direction, there remains significant scope for improvement. To address these challenges, we introduce a novel multi-objective optimization (MOO) approach for watermarking that utilizes lightweight networks to generate token-specific watermarking logits and splitting ratios. By leveraging MOO to optimize for both detection and semantic objective functions, our method simultaneously achieves detectability and semantic integrity. Experimental results show that our method outperforms current watermarking techniques in enhancing the detectability of texts generated by LLMs while maintaining their semantic coherence. Our code is available at https://github.com/mignonjia/TS_watermark.
Paper Structure (45 sections, 1 theorem, 12 equations, 13 figures, 5 tables)

This paper contains 45 sections, 1 theorem, 12 equations, 13 figures, 5 tables.

Key Result

Theorem 4.1

Consider $T$ independent Bernoulli random variables $X_1,\ldots,X_T$, each with means $\mu_1,\ldots, \mu_T$, $0 < \mu_t < 1$$\forall t \in 1, \ldots, T$. The sum of these variables, $\sum_{t=1}^T X_t$, follows a Poisson binomial distribution. When $T$ is sufficiently large, this distribution can be

Figures (13)

  • Figure 1: The training procedure is as follows: During the LLM text generation, we utilize the $\gamma$-generator and $\delta$-generator to modify the probability of each token before sampling the next one. The parameters of these networks are learned through optimization of the detection loss (Eq. \ref{['eq:detection-loss']}) and semantic loss (Eq. \ref{['eq:semantic-loss']}) within a multi-objective optimization framework.
  • Figure 2: Comparison of the trade-off for semantic integrity and detectability of different methods applied to OPT-1.3B.
  • Figure 3: Performance of Ours (trained on OPT-1.3B) and KGW when applied to LLAMA2 7B, 13B, and 70B.
  • Figure 4: Distribution of watermark logit $\delta$ (left y-axis) and splitting ratio $\gamma$ (right y-axis) across different part-of-speech categories of the preceding token.
  • Figure 5: Comparison of our method with KGW under the Dipper paraphrase attack.
  • ...and 8 more figures

Theorems & Definitions (2)

  • Theorem 4.1
  • Definition 4.2