Entropy-Guided Watermarking for LLMs: A Test-Time Framework for Robust and Traceable Text Generation
Shizhan Cai, Liang Ding, Dacheng Tao
TL;DR
This work tackles content traceability in LLMs by introducing a cumulative watermark entropy threshold applied at test time, enabling robust detectability without sacrificing text quality. The approach integrates seamlessly with existing sampling functions, using a seed-based secret key and a learned mapping to embed watermarks only when entropy exceeds a threshold, while maintaining indistinguishability up to a negligible bound. A detection framework based on p-values and a fine-grained alignment cost provides reliable watermark verification under adaptive queries and attacks, including paraphrase. Empirical results across multiple lightweight models and long-answer QA datasets show substantial improvements over prior schemes in detectability and robustness, with minimal quality degradation and strong generalization to larger models like Llama-3.1-8B, highlighting practical impact for model accountability and misuse mitigation in real-world deployments.
Abstract
The rapid development of Large Language Models (LLMs) has intensified concerns about content traceability and potential misuse. Existing watermarking schemes for sampled text often face trade-offs between maintaining text quality and ensuring robust detection against various attacks. To address these issues, we propose a novel watermarking scheme that improves both detectability and text quality by introducing a cumulative watermark entropy threshold. Our approach is compatible with and generalizes existing sampling functions, enhancing adaptability. Experimental results across multiple LLMs show that our scheme significantly outperforms existing methods, achieving over 80\% improvements on widely-used datasets, e.g., MATH and GSM8K, while maintaining high detection accuracy.
