Table of Contents
Fetching ...

Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs

Yiyang Luo, Ke Lin, Chao Gu, Jiahui Hou, Lijie Wen, Ping Luo

TL;DR

This paper introduces watermark collision as a general attack philosophy against logit-based LLM watermarks. It formalizes how sequentially applied watermarks induce an entangled text distribution $P_{ ext{entangled}} = f(oldsymbol{P}_{w^{(i)}})$ that weakens detectors relying on individual watermark distributions. Through a pipeline combining a Watermarker, Paraphrase/Back-translation/Mask-and-fill Colliders, and Detectors, the study shows collisions can strengthen attacks without notably harming text quality, and multi-round collisions further degrade detection. The findings reveal fundamental vulnerabilities in current watermarking schemes with implications for API tracing, detection, and the design of more robust watermarking mechanisms.

Abstract

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation. In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.

Lost in Overlap: Exploring Logit-based Watermark Collision in LLMs

TL;DR

This paper introduces watermark collision as a general attack philosophy against logit-based LLM watermarks. It formalizes how sequentially applied watermarks induce an entangled text distribution that weakens detectors relying on individual watermark distributions. Through a pipeline combining a Watermarker, Paraphrase/Back-translation/Mask-and-fill Colliders, and Detectors, the study shows collisions can strengthen attacks without notably harming text quality, and multi-round collisions further degrade detection. The findings reveal fundamental vulnerabilities in current watermarking schemes with implications for API tracing, detection, and the design of more robust watermarking mechanisms.

Abstract

The proliferation of large language models (LLMs) in generating content raises concerns about text copyright. Watermarking methods, particularly logit-based approaches, embed imperceptible identifiers into text to address these challenges. However, the widespread usage of watermarking across diverse LLMs has led to an inevitable issue known as watermark collision during common tasks, such as paraphrasing or translation. In this paper, we introduce watermark collision as a novel and general philosophy for watermark attacks, aimed at enhancing attack performance on top of any other attacking methods. We also provide a comprehensive demonstration that watermark collision poses a threat to all logit-based watermark algorithms, impacting not only specific attack scenarios but also downstream applications.
Paper Structure (38 sections, 3 equations, 7 figures, 12 tables)

This paper contains 38 sections, 3 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Illustration of watermark collisions.
  • Figure 2: The collision pipeline. $T_W$ denotes text with the first watermark $W$, where $T_C$ denotes text with dual watermarks from a different collider $C\in\{P,R,M\}$. Unwatermarked text generated from $W$ and $C$ is denoted as $T_{W'}$ and $T_{C'}$. $T_C$ and $T_{C'}$ are then examined by $D_W$ and $D_C$ to determine the presence of watermark $W$ and $C$. Texts in red and green are visualization samples of the red-green list showing the original watermark $W$.
  • Figure 3: Multi-round TPR of paraphrased text under a series of paraphrase attacks by the same type of paraphraser with different watermarks. $\varnothing$ represents the original detection TPR before paraphrasing. A sequence of paraphrasers $(P^{(0)},P^{(1)},P^{(2)},\dotsc)$ is applied consecutively to the generated text from the preceding paraphraser.
  • Figure 4: The paraphrase prompt template for LLaMA-2 paraphraser.
  • Figure 5: The paraphrase prompt template for Qwen2 paraphraser.
  • ...and 2 more figures