Table of Contents
Fetching ...

Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation

Shizhan Cai, Liang Ding, Dacheng Tao

TL;DR

This work tackles content traceability in LLMs by introducing a cumulative watermark entropy threshold applied at test time, enabling robust detectability without sacrificing text quality. The approach integrates seamlessly with existing sampling functions, using a seed-based secret key and a learned mapping to embed watermarks only when entropy exceeds a threshold, while maintaining indistinguishability up to a negligible bound. A detection framework based on p-values and a fine-grained alignment cost provides reliable watermark verification under adaptive queries and attacks, including paraphrase. Empirical results across multiple lightweight models and long-answer QA datasets show substantial improvements over prior schemes in detectability and robustness, with minimal quality degradation and strong generalization to larger models like Llama-3.1-8B, highlighting practical impact for model accountability and misuse mitigation in real-world deployments.

Abstract

The rapid development of Large Language Models (LLMs) has intensified concerns about content traceability and potential misuse. Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks. To address these issues, we propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold. Our approach is compatible with and generalizes existing sampling functions, enhancing adaptability. Experimental results across multiple LLMs show that our scheme significantly outperforms existing methods, achieving over 80\% improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining high detection accuracy.

Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation

TL;DR

This work tackles content traceability in LLMs by introducing a cumulative watermark entropy threshold applied at test time, enabling robust detectability without sacrificing text quality. The approach integrates seamlessly with existing sampling functions, using a seed-based secret key and a learned mapping to embed watermarks only when entropy exceeds a threshold, while maintaining indistinguishability up to a negligible bound. A detection framework based on p-values and a fine-grained alignment cost provides reliable watermark verification under adaptive queries and attacks, including paraphrase. Empirical results across multiple lightweight models and long-answer QA datasets show substantial improvements over prior schemes in detectability and robustness, with minimal quality degradation and strong generalization to larger models like Llama-3.1-8B, highlighting practical impact for model accountability and misuse mitigation in real-world deployments.

Abstract

The rapid development of Large Language Models (LLMs) has intensified concerns about content traceability and potential misuse. Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks. To address these issues, we propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold. Our approach is compatible with and generalizes existing sampling functions, enhancing adaptability. Experimental results across multiple LLMs show that our scheme significantly outperforms existing methods, achieving over 80\% improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining high detection accuracy.

Paper Structure

This paper contains 20 sections, 23 equations, 5 figures, 4 tables, 3 algorithms.

Figures (5)

  • Figure 1: Overview of the watermarking workflow during LLM sampling.Apricot arrows represent the implantation of the secret key, while blue arrows illustrate the original LLM generation flow. The watermarking approach must balance two critical requirements: (1) text quality—ensuring that output text retains the same quality as non-watermarked text to preserve user experience; and (2) detectability—making the watermark reliably detectable, even when users modify the output. Existing schemes exhibit text quality degradation and weaknesses in detectability under adversarial attacks.
  • Figure 2: Workflow of the proposed watermarking and detection algorithm. The diagram illustrates the core steps of the watermarking process, including token distribution manipulation, secret key sampling, watermark embedding, and subsequent detection. Starting with a token distribution from the LLM, a secret key ($\xi$) is sampled to produce an indistinguishable watermark ($Y$), which is embedded into the generated text ($\tilde{Y}$). The watermark remains robust against various modification attacks, such as deleting, inserting, and substituting. Detection involves testing whether the text contains the watermark, either accepting or rejecting the null hypothesis.
  • Figure 3: ROC curves for our, ITS, and Binary scheme under different attack conditions. The first row shows ROC curves for watermarked text before attacks, while the second row illustrates the impact of paraphrasing attacks on the same text. Each subplot corresponds to a specific model (Llama, OPT, Gemma, phi). Our scheme demonstrates superior performance, achieving high AUC values both before and after attacks, with minimal degradation in classification ability compared to ITS and Binary schemes. In contrast, the Binary scheme shows significant vulnerability, with AUC values dropping below 0.35 post-attack, highlighting its limited robustness in adversarial scenarios.
  • Figure 4: Detectability of different sampling methods as text length increases. The plot compares the True Positive Rate (TPR) at a fixed False Positive Rate (FPR = 1%) for ITS sampling and Binary sampling, both with and without adversarial attacks. Results show that ITS sampling achieves higher detectability with shorter text lengths and maintains robustness under attacks, while Binary sampling demonstrates slower detectability growth and greater vulnerability to attacks as text length increases.
  • Figure 5: Comparison of Binary Sampling and Inverse Transform Sampling. The figure illustrates the mechanisms of the two sampling functions. Huffman Encoding (H.E.) is used in Binary Sampling to map a uniform random variable $u_1$ to a discrete binary outcome. In contrast, Inverse Transform Sampling applies a random permutation to introduce additional randomness while directly drawing $u$ from the uniform distribution $U[0,1]$.

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4