Table of Contents
Fetching ...

Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

Li An, Yujian Liu, Yepeng Liu, Yang Zhang, Yuheng Bu, Shiyu Chang

TL;DR

This work tackles spoofing risks in LLM watermarking by introducing a semantic-aware, post-hoc watermarking framework that uses a contrastively trained mapping f_θ to create a global green-red token split based on the entire target text. Watermarks are embedded by perturbing logits toward green tokens during generation, while detection relies on the token distribution of the produced text; the approach emphasizes semantic integrity and resistance to meaning-preserving edits. Empirical results on realnewslike (C4) and LFQA show strong robustness to paraphrasing and resilience against sentiment reversal and hate-speech insertion attacks, with competitive text quality and high detectability across backbones such as Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct, representing a significant advance in secure and semantically aware watermarking for LLMs. Code is released at the provided GitHub repository.

Abstract

Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.

Defending LLM Watermarking Against Spoofing Attacks with Contrastive Representation Learning

TL;DR

This work tackles spoofing risks in LLM watermarking by introducing a semantic-aware, post-hoc watermarking framework that uses a contrastively trained mapping f_θ to create a global green-red token split based on the entire target text. Watermarks are embedded by perturbing logits toward green tokens during generation, while detection relies on the token distribution of the produced text; the approach emphasizes semantic integrity and resistance to meaning-preserving edits. Empirical results on realnewslike (C4) and LFQA show strong robustness to paraphrasing and resilience against sentiment reversal and hate-speech insertion attacks, with competitive text quality and high detectability across backbones such as Llama-3.1-8B-Instruct and Qwen2.5-7B-Instruct, representing a significant advance in secure and semantically aware watermarking for LLMs. Code is released at the provided GitHub repository.

Abstract

Watermarking has emerged as a promising technique for detecting texts generated by LLMs. Current research has primarily focused on three design criteria: high quality of the watermarked text, high detectability, and robustness against removal attack. However, the security against spoofing attacks remains relatively understudied. For example, a piggyback attack can maliciously alter the meaning of watermarked text-transforming it into hate speech-while preserving the original watermark, thereby damaging the reputation of the LLM provider. We identify two core challenges that make defending against spoofing difficult: (1) the need for watermarks to be both sensitive to semantic-distorting changes and insensitive to semantic-preserving edits, and (2) the contradiction between the need to detect global semantic shifts and the local, auto-regressive nature of most watermarking schemes. To address these challenges, we propose a semantic-aware watermarking algorithm that post-hoc embeds watermarks into a given target text while preserving its original meaning. Our method introduces a semantic mapping model, which guides the generation of a green-red token list, contrastively trained to be sensitive to semantic-distorting changes and insensitive to semantic-preserving changes. Experiments on two standard benchmarks demonstrate strong robustness against removal attacks and security against spoofing attacks, including sentiment reversal and toxic content insertion, while maintaining high watermark detectability. Our approach offers a significant step toward more secure and semantically aware watermarking for LLMs. Our code is available at https://github.com/UCSB-NLP-Chang/contrastive-watermark.

Paper Structure

This paper contains 27 sections, 1 equation, 14 figures, 7 tables.

Figures (14)

  • Figure 1: Performance on several key dimensions (higher is better) for existing approaches and our method, details are provided in Section \ref{['subsec:exp-setting']}.
  • Figure 2: Overview of our method. Left: The semantic-aware watermarking framework. An LLM is prompted to paraphrase a given target text. During auto-regressive generation, the entire target text is fed to a semantic mapping model to construct a green-red token split. The LLM's output distribution is then perturbed by scaling up the probability of tokens assigned in the green list. Upper right: Data transformation and mapping process. Lower right: The triplet loss used to train the semantic mapping model.
  • Figure 3: Performance of LLM-as-judge on a scale of 1--3 (higher is better).
  • Figure 4: Performance trade-off with varying watermarking strength.
  • Figure 5: Impacts of input context for the mapping model.
  • ...and 9 more figures