Table of Contents
Fetching ...

Watermarking Needs Input Repetition Masking

David Khachaturov, Robert Mullins, Ilia Shumailov, Sumanth Dathathri

TL;DR

This work investigates watermark mimicry in LLM interactions, showing that linguistic adaptation by both humans and LLMs can propagate watermark-like signals beyond watermarked prompts. It evaluates two prominent n-gram–based watermark schemes across LLM–LLM and human–LLM conversations, revealing measurable mimicry especially in smaller models and under certain prompts, while highlighting that input repetition masking can suppress watermark signals. The study also employs a third-party detector to assess how human language may be misclassified as machine-generated as dialogues lengthen, underscoring practical limits of current watermarking. Collectively, the findings stress the need for watermarking schemes with lower false positives and point toward alternative watermarking dimensions (e.g., semantic or stylistic cues) to ensure robust long-term provenance.

Abstract

Recent advancements in Large Language Models (LLMs) raised concerns over potential misuse, such as for spreading misinformation. In response two counter measures emerged: machine learning-based detectors that predict if text is synthetic, and LLM watermarking, which subtly marks generated text for identification and attribution. Meanwhile, humans are known to adjust language to their conversational partners both syntactically and lexically. By implication, it is possible that humans or unwatermarked LLMs could unintentionally mimic properties of LLM generated text, making counter measures unreliable. In this work we investigate the extent to which such conversational adaptation happens. We call the concept $\textit{mimicry}$ and demonstrate that both humans and LLMs end up mimicking, including the watermarking signal even in seemingly improbable settings. This challenges current academic assumptions and suggests that for long-term watermarking to be reliable, the likelihood of false positives needs to be significantly lower, while longer word sequences should be used for seeding watermarking mechanisms.

Watermarking Needs Input Repetition Masking

TL;DR

This work investigates watermark mimicry in LLM interactions, showing that linguistic adaptation by both humans and LLMs can propagate watermark-like signals beyond watermarked prompts. It evaluates two prominent n-gram–based watermark schemes across LLM–LLM and human–LLM conversations, revealing measurable mimicry especially in smaller models and under certain prompts, while highlighting that input repetition masking can suppress watermark signals. The study also employs a third-party detector to assess how human language may be misclassified as machine-generated as dialogues lengthen, underscoring practical limits of current watermarking. Collectively, the findings stress the need for watermarking schemes with lower false positives and point toward alternative watermarking dimensions (e.g., semantic or stylistic cues) to ensure robust long-term provenance.

Abstract

Recent advancements in Large Language Models (LLMs) raised concerns over potential misuse, such as for spreading misinformation. In response two counter measures emerged: machine learning-based detectors that predict if text is synthetic, and LLM watermarking, which subtly marks generated text for identification and attribution. Meanwhile, humans are known to adjust language to their conversational partners both syntactically and lexically. By implication, it is possible that humans or unwatermarked LLMs could unintentionally mimic properties of LLM generated text, making counter measures unreliable. In this work we investigate the extent to which such conversational adaptation happens. We call the concept and demonstrate that both humans and LLMs end up mimicking, including the watermarking signal even in seemingly improbable settings. This challenges current academic assumptions and suggests that for long-term watermarking to be reliable, the likelihood of false positives needs to be significantly lower, while longer word sequences should be used for seeding watermarking mechanisms.

Paper Structure

This paper contains 14 sections, 13 figures, 2 tables.

Figures (13)

  • Figure 1: An intuitive description of watermark mimicry. Here, a watermarked prompt is used together with an unwatermarked model. During a conversation parts of the original watermerk (green) prompt are reused by the model, leading to watermark mimicry (red), resulting in an unwatermarked model outputting watermarked response. Importantly, the watermark can even be stronger in the response, since it can by a coincidence produce a watermark in unaffected by mimicry areas (yellow).
  • Figure 2: aaronson2022my scheme, changing ngram size, blue shows percentage of watermarked prompts, orange shows percentage of watermarked responses green shows percentage of watermarked responses where response watermark is stronger than in the prompt, red shows percentage of cases with both prompt and response watermarked.
  • Figure 3: Human--LLM dialogues (split in \ref{['fig:sharegpt_wildchat']}), filtered to contain long conversations in English. 520 are from ShareGPT, filtered for 100+ turns (100 human, 100 LLM). 446 are from WildChat dataset, filtered for 50+ turns (100 human, 100 LLM).
  • Figure 4: Aaronson. watermarking with Guanco-7b and varying temperatures
  • Figure 5: Aaronson. watermarking with Guanco-13b and varying temperatures
  • ...and 8 more figures