Table of Contents
Fetching ...

Adaptive Text Watermark for Large Language Models

Yepeng Liu, Yuheng Bu

TL;DR

The paper tackles AI-generated text misuse by proposing an adaptive watermarking framework for LLMs that preserves text quality while remaining robust to paraphrase and difficult to forge. It combines entropy-aware token identification (AWTI), semantic-driven logits scaling (SLSVE), and adaptive temperature scaling (AWTS) to embed watermarks only in high-entropy regions and to obscure the watermark from attackers. Detection is agnostic to prompts and models, relying on an approximate likelihood ratio test using the auxiliary measurement model and semantic mappings. Empirical results across multiple models and datasets show competitive robustness to attacks, perplexity near unwatermarked text, and strong security against spoofing, with ablations clarifying the contribution of each component. Overall, the approach offers a practical, model-agnostic watermarking solution that balances robustness, security, and text quality for real-world deployment.

Abstract

The advancement of Large Language Models (LLMs) has led to increasing concerns about the misuse of AI-generated text, and watermarking for LLM-generated text has emerged as a potential solution. However, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. This paper proposes an adaptive watermarking strategy to address this problem. To improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. For the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. Our experiments involving various LLMs demonstrate that our approach achieves comparable robustness performance to existing watermark methods. Additionally, the text generated by our method has perplexity comparable to that of \emph{un-watermarked} LLMs while maintaining security even under various attacks.

Adaptive Text Watermark for Large Language Models

TL;DR

The paper tackles AI-generated text misuse by proposing an adaptive watermarking framework for LLMs that preserves text quality while remaining robust to paraphrase and difficult to forge. It combines entropy-aware token identification (AWTI), semantic-driven logits scaling (SLSVE), and adaptive temperature scaling (AWTS) to embed watermarks only in high-entropy regions and to obscure the watermark from attackers. Detection is agnostic to prompts and models, relying on an approximate likelihood ratio test using the auxiliary measurement model and semantic mappings. Empirical results across multiple models and datasets show competitive robustness to attacks, perplexity near unwatermarked text, and strong security against spoofing, with ablations clarifying the contribution of each component. Overall, the approach offers a practical, model-agnostic watermarking solution that balances robustness, security, and text quality for real-world deployment.

Abstract

The advancement of Large Language Models (LLMs) has led to increasing concerns about the misuse of AI-generated text, and watermarking for LLM-generated text has emerged as a potential solution. However, it is challenging to generate high-quality watermarked text while maintaining strong security, robustness, and the ability to detect watermarks without prior knowledge of the prompt or model. This paper proposes an adaptive watermarking strategy to address this problem. To improve the text quality and maintain robustness, we adaptively add watermarking to token distributions with high entropy measured using an auxiliary model and keep the low entropy token distributions untouched. For the sake of security and to further minimize the watermark's impact on text quality, instead of using a fixed green/red list generated from a random secret key, which can be vulnerable to decryption and forgery, we adaptively scale up the output logits in proportion based on the semantic embedding of previously generated text using a well designed semantic mapping model. Our experiments involving various LLMs demonstrate that our approach achieves comparable robustness performance to existing watermark methods. Additionally, the text generated by our method has perplexity comparable to that of \emph{un-watermarked} LLMs while maintaining security even under various attacks.
Paper Structure (27 sections, 7 equations, 10 figures, 13 tables, 2 algorithms)

This paper contains 27 sections, 7 equations, 10 figures, 13 tables, 2 algorithms.

Figures (10)

  • Figure 1: Workflow of the proposed Adaptive Watermark. Our Adaptive Watermark method will assess the entropy of the distribution for all previously generated tokens and add watermarks only to high-entropy tokens. This figure illustrates two cases in the text generation process. For the first example, the distribution of the next token has high entropy as measured by AWTI, indicating high uncertainty. Then, the SLSVE module will extract the logits scaling vector based on the semantics of this text. Subsequently, we apply AWTS to perturb the sampling distribution. As for the second text, it results in low entropy measured by AWTI, suggesting a low uncertainty for the next token. Then, the next token will be directly sampled from the original probability distribution.
  • Figure 2: Workflow of Adaptive Detection. This figure illustrates a single step during the detection process. At each time step, we will first check if the current token is a potential watermarked token by applying AWTI to estimate the entropy of the current distribution. For potential watermarked token, the preceding text will be used to extract the logits scaling vector through SLSVE. If the corresponding value of the potential watermarked token in the logits scaling vector is positive, it will be added to the total score.
  • Figure 3: Comparison of text perplexity among human-written text, un-watermarked text, and texts using various watermark methods across different language models. For KGW-0 and KGW-1, the watermark strength and green list size are set as $2.0$ and $0.5$, respectively.
  • Figure 4: Comparison of text perplexity at varying entropy thresholds (left) and across different watermark strengths (right). $\alpha$ represents the entropy threshold. $\delta$ represents the watermark strength.
  • Figure 5: The comparison of text perplexity across various watermarking methods, conducted on Mistral-7B with C4 dataset, with the perplexity calculated using GPT-3.
  • ...and 5 more figures