Table of Contents
Fetching ...

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

Hao Li, Yubing Ren, Yanan Cao, Yingjie Li, Fang Fang, Shi Wang, Li Guo

TL;DR

DualGuard introduces an adaptive dual-stream watermarking framework to defend LLM outputs against both paraphrase and piggyback spoofing attacks, enabling reliable detection and tracing of malicious content. By employing a shared embedding model with standard and adversarial watermark heads and content-sensitive injection/detection, it achieves strong robustness while preserving text quality. The approach demonstrates high watermark detectability and spoofing traceability across multiple datasets and language models, with favorable performance against diverse attack models and modest computational overhead. This work advances trustworthy watermark deployment for real-world LLM usage by addressing spoofing vulnerabilities that previous schemes neglected.

Abstract

With the rapid development of cloud-based services, large language models (LLMs) have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.

DualGuard: Dual-stream Large Language Model Watermarking Defense against Paraphrase and Spoofing Attack

TL;DR

DualGuard introduces an adaptive dual-stream watermarking framework to defend LLM outputs against both paraphrase and piggyback spoofing attacks, enabling reliable detection and tracing of malicious content. By employing a shared embedding model with standard and adversarial watermark heads and content-sensitive injection/detection, it achieves strong robustness while preserving text quality. The approach demonstrates high watermark detectability and spoofing traceability across multiple datasets and language models, with favorable performance against diverse attack models and modest computational overhead. This work advances trustworthy watermark deployment for real-world LLM usage by addressing spoofing vulnerabilities that previous schemes neglected.

Abstract

With the rapid development of cloud-based services, large language models (LLMs) have become increasingly accessible through various web platforms. However, this accessibility has also led to growing risks of model abuse. LLM watermarking has emerged as an effective approach to mitigate such misuse and protect intellectual property. Existing watermarking algorithms, however, primarily focus on defending against paraphrase attacks while overlooking piggyback spoofing attacks, which can inject harmful content, compromise watermark reliability, and undermine trust in attribution. To address this limitation, we propose DualGuard, the first watermarking algorithm capable of defending against both paraphrase and spoofing attacks. DualGuard employs the adaptive dual-stream watermarking mechanism, in which two complementary watermark signals are dynamically injected based on the semantic content. This design enables DualGuard not only to detect but also to trace spoofing attacks, thereby ensuring reliable and trustworthy watermark detection. Extensive experiments conducted across multiple datasets and language models demonstrate that DualGuard achieves excellent detectability, robustness, traceability, and text quality, effectively advancing the state of LLM watermarking for real-world applications.

Paper Structure

This paper contains 35 sections, 10 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: An example is generated using the Llama-3.1-8B-Instruct model with the KGW watermarking kirchenbauer2023watermark, where the watermark mistakenly attributes malicious content (highlighted in red) injected by the spoofing attack to the LLM.
  • Figure 2: Overall framework of our watermarking method DualGuard. Gray indicates un-watermarked tokens, while blue and orange denote tokens watermarked by the standard and adversarial watermark heads, respectively.
  • Figure 3: Experimental results of different attack models on RealNewsLike and RealToxicityPrompts dataset.
  • Figure 4: The impact of Dual-stream Selection.
  • Figure 5: PPL on RealNewsLike and BookSum datasets.
  • ...and 5 more figures