Table of Contents
Fetching ...

Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark

Yuhang Cai, Yaofei Wang, Donghui Hu, Chen Gu

TL;DR

This work targets spoofing risks in LLM watermarks by proposing dual detection: modification detection and generated-text detection. It leverages an unbiased $\delta$-reweight watermark to create inconsistent distortion, enabling IDD to flag tampering via discarded/inconsistent tokens while allowing valid watermarks to enable text-origin verification through drLLR. Empirical results on PubMedQA prompts with OPT-6.7B show IDD accurately detects modifications across addition, deletion, and replacement, and drLLR achieves high AUC for generated-text detection even under perturbations. The approach offers a practical, dual-capability watermarking framework to bolster trust and accountability in LLM deployments.

Abstract

The development of large language models (LLMs) has raised concerns about potential misuse. One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction. Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks. For example, attackers can alter the watermarked text to produce harmful content without compromising the presence of the watermark, which could lead to false attribution of this malicious content to the LLM. This situation poses a serious threat to the LLMs service providers and highlights the significance of achieving modification detection and generated-text detection simultaneously. Therefore, we propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification. We introduce a new metric called ``discarded tokens", which measures the number of tokens not included in watermark detection. When a modification occurs, this metric changes and can serve as evidence of the modification. Additionally, we improve the watermark detection process and introduce a novel method for unbiased watermark. Our experiments demonstrate that we can achieve effective dual detection capabilities: modification detection and generated-text detection by watermark.

Modification and Generated-Text Detection: Achieving Dual Detection Capabilities for the Outputs of LLM by Watermark

TL;DR

This work targets spoofing risks in LLM watermarks by proposing dual detection: modification detection and generated-text detection. It leverages an unbiased -reweight watermark to create inconsistent distortion, enabling IDD to flag tampering via discarded/inconsistent tokens while allowing valid watermarks to enable text-origin verification through drLLR. Empirical results on PubMedQA prompts with OPT-6.7B show IDD accurately detects modifications across addition, deletion, and replacement, and drLLR achieves high AUC for generated-text detection even under perturbations. The approach offers a practical, dual-capability watermarking framework to bolster trust and accountability in LLM deployments.

Abstract

The development of large language models (LLMs) has raised concerns about potential misuse. One practical solution is to embed a watermark in the text, allowing ownership verification through watermark extraction. Existing methods primarily focus on defending against modification attacks, often neglecting other spoofing attacks. For example, attackers can alter the watermarked text to produce harmful content without compromising the presence of the watermark, which could lead to false attribution of this malicious content to the LLM. This situation poses a serious threat to the LLMs service providers and highlights the significance of achieving modification detection and generated-text detection simultaneously. Therefore, we propose a technique to detect modifications in text for unbiased watermark which is sensitive to modification. We introduce a new metric called ``discarded tokens", which measures the number of tokens not included in watermark detection. When a modification occurs, this metric changes and can serve as evidence of the modification. Additionally, we improve the watermark detection process and introduce a novel method for unbiased watermark. Our experiments demonstrate that we can achieve effective dual detection capabilities: modification detection and generated-text detection by watermark.

Paper Structure

This paper contains 15 sections, 4 equations, 3 figures, 3 tables, 1 algorithm.

Figures (3)

  • Figure 1: The framework of dual detection capabilities for LLM-generated text by watermark. We analyze the discarded token(s) caused by modification, which fails to function as evidence for watermark detection. If the number of these tokens is larger than the threshold, it confirms the existence of modification. Meanwhile, we achieve generated-text detection by remaining tokens with watermark in the text.
  • Figure 2: Sampling method of $\delta$-reweight and inconsistent distortion of $\delta$-reweight caused by modified token(s). The upper part of the figure illustrates the sampling process without a modified token. The lower part demonstrates the process that modified token in context tokens disturbs sampling method and result in inconsistent tokens marked in red until there is no modified tokens in context tokens. The unaffected tokens in green are still consistent with sampled tokens and function as evidence for detection result.
  • Figure 3: Distribution of the number of discarded tokens (tokens in red list for KGW and inconsistent tokens for $\delta$-reweight) in text under different attack.