Table of Contents
Fetching ...

Improving Detection of Watermarked Language Models

Dara Bahri, John Wieting

TL;DR

This work investigates improving first-party AI-generated content detection by fusing watermark-based and non-watermark detectors into hybrid schemes. It demonstrates that hybrids, especially using a two-sided cascade or logistic regression, outperform either approach alone across entropy levels, text lengths, and attack conditions. The study reveals that watermark strength and available entropy critically influence detector performance, with RoBERTa-based detectors offering strong complementary signals in low-entropy regimes; however, attacks like paraphrasing can largely erase watermark signals. Practically, the results provide actionable guidance for deploying efficient, robust AGC detectors in real-world settings, highlighting trade-offs between accuracy, compute, and resilience to manipulation. Overall, hybrid detection emerges as a promising path to robustly identify LLM-generated content in diverse and adversarial contexts.

Abstract

Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.

Improving Detection of Watermarked Language Models

TL;DR

This work investigates improving first-party AI-generated content detection by fusing watermark-based and non-watermark detectors into hybrid schemes. It demonstrates that hybrids, especially using a two-sided cascade or logistic regression, outperform either approach alone across entropy levels, text lengths, and attack conditions. The study reveals that watermark strength and available entropy critically influence detector performance, with RoBERTa-based detectors offering strong complementary signals in low-entropy regimes; however, attacks like paraphrasing can largely erase watermark signals. Practically, the results provide actionable guidance for deploying efficient, robust AGC detectors in real-world settings, highlighting trade-offs between accuracy, compute, and resilience to manipulation. Overall, hybrid detection emerges as a promising path to robustly identify LLM-generated content in diverse and adversarial contexts.

Abstract

Watermarking has recently emerged as an effective strategy for detecting the generations of large language models (LLMs). The strength of a watermark typically depends strongly on the entropy afforded by the language model and the set of input prompts. However, entropy can be quite limited in practice, especially for models that are post-trained, for example via instruction tuning or reinforcement learning from human feedback (RLHF), which makes detection based on watermarking alone challenging. In this work, we investigate whether detection can be improved by combining watermark detectors with non-watermark ones. We explore a number of hybrid schemes that combine the two, observing performance gains over either class of detector under a wide range of experimental conditions.

Paper Structure

This paper contains 30 sections, 7 equations, 6 figures, 23 tables.

Figures (6)

  • Figure 1: Detection accuracy as a function of average response entropy of prompts. The response entropy of each prompt is estimated using 4 non-watermarked generations and the prompts are partitioned based on 20% percentiles. For example, accuracy at entropy $x$ is computed on the 20% of prompts with the largest entropy that is less than or equal to $x$. Gemma-7B-instruct is applied to databricks-dolly-15k under Aaronson, Kirchenbauer, and Bahri watermarking schemes. Mistral-7B-instruct generations are taken as negatives with a target length of 100. Likelihood (LLh) and RoBERTa detectors are shown. We see that watermarking performance improves with entropy, as we expect. While LLh also improves with entropy, the RoBERTa classifier is strong and also fairly flat, which the hybrid approaches successfully leverage to significantly improve watermarking in the low entropy regime. More results are in the Appendix.
  • Figure 2: Detection accuracy of the cascades and logistic regression models as a function of the text length $T$ (in tokens) used for detection. The first $T$ tokens of each text is used and two standard error bars are shaded. Gemma-7B-instruct is applied to databricks-dolly-15k under Aaronson and human negatives. We observe that for the watermark detector and all non-watermark detectors except the RoBERTa classifier, performance improves sharply with more test tokens, as we expect. RoBERTa's strong performance at low token count is noteworthy and likely due to the training procedure which explicitly incorporates texts of varying lengths. We find that cascades and LR combinations boost performance over either detector fairly consistently across lengths, providing assurance to the practitioner that these combinations confer benefits no matter the length of the test text.
  • Figure 3: Detection accuracy of cascades and logistic regression as a function of the percentage of tokens corrupted. Two standard error bars are shaded. Gemma-7B-instruct is applied to databricks-dolly-15k under different watermarking schemes using the RoBERTa classifier with human negatives and 100 tokens. We observe that while watermark detectors all degrade with more corruption, the RoBERTa classifier maintains consistently strong performance and that cascades and LR offer consistent improvements over either at all corruption levels. More results are in the Appendix.
  • Figure 4: The decision boundary learned by our logistic regression (LR) model when Gemma-7B-instruct is applied to databricks-dolly-15k under the Aaronson scheme. Human responses are taken as negatives and 100 tokens are used. Our LR models are trained on top of $z$-score normalized watermark and non-watermark scores. The area $y >= x$ represents positive predictions (i.e. the sample is from our model). $y = -x$ is shown for reference. By noticing that the slope magnitude is less than one, we see the learned LR consistently puts more weight on the non-watermark detector scores than the watermark ones and this is more pronounced for RoBERTa.
  • Figure 5: Detection accuracy of cascades and logistic regression as a function of average response entropy of prompts. Gemma-7B-instruct is applied to databricks-dolly-15k under the Aaronson scheme. Human responses are taken as negatives and 100 tokens are used. Watermarking improves with entropy and hybrid methods boast gains across the board.
  • ...and 1 more figures