Table of Contents
Fetching ...

REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models

Ruisi Zhang, Shehzeen Samarah Hussain, Paarth Neekhara, Farinaz Koushanfar

TL;DR

REMARK-LLM presents a robust, efficient framework to watermark LLM-generated text by embedding binary signatures into the model's output distribution via a learning-based message encoding, a differentiable reparameterization to a sparse token distribution, and a transformer-based decoding module for signature extraction. An optimized beam search during insertion preserves linguistic coherence while enabling reliable watermark retrieval, even under malicious transformations seen in real-world scenarios. End-to-end training minimizes semantic loss and maximizes watermark recoverability, while incorporating transformations to harden against removal and detection attacks. Across multiple datasets and unseen LLM architectures, REMARK-LLM achieves up to 2$ imes$ more embedded bits with strong semantic fidelity, transferability, and resilience, suggesting practical applicability for IP protection, anti-plagiarism, and misinformation tracing in real-world deployments.

Abstract

We present REMARK-LLM, a novel efficient, and robust watermarking framework designed for texts generated by large language models (LLMs). Synthesizing human-like content using LLMs necessitates vast computational resources and extensive datasets, encapsulating critical intellectual property (IP). However, the generated content is prone to malicious exploitation, including spamming and plagiarism. To address the challenges, REMARK-LLM proposes three new components: (i) a learning-based message encoding module to infuse binary signatures into LLM-generated texts; (ii) a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens; (iii) a decoding module dedicated for signature extraction; Furthermore, we introduce an optimized beam search algorithm to guarantee the coherence and consistency of the generated content. REMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content, while ensuring effective watermark retrieval. Extensive evaluations on multiple unseen datasets highlight REMARK-LLM proficiency and transferability in inserting 2 times more signature bits into the same texts when compared to prior art, all while maintaining semantic integrity. Furthermore, REMARK-LLM exhibits better resilience against a spectrum of watermark detection and removal attacks.

REMARK-LLM: A Robust and Efficient Watermarking Framework for Generative Large Language Models

TL;DR

REMARK-LLM presents a robust, efficient framework to watermark LLM-generated text by embedding binary signatures into the model's output distribution via a learning-based message encoding, a differentiable reparameterization to a sparse token distribution, and a transformer-based decoding module for signature extraction. An optimized beam search during insertion preserves linguistic coherence while enabling reliable watermark retrieval, even under malicious transformations seen in real-world scenarios. End-to-end training minimizes semantic loss and maximizes watermark recoverability, while incorporating transformations to harden against removal and detection attacks. Across multiple datasets and unseen LLM architectures, REMARK-LLM achieves up to 2 more embedded bits with strong semantic fidelity, transferability, and resilience, suggesting practical applicability for IP protection, anti-plagiarism, and misinformation tracing in real-world deployments.

Abstract

We present REMARK-LLM, a novel efficient, and robust watermarking framework designed for texts generated by large language models (LLMs). Synthesizing human-like content using LLMs necessitates vast computational resources and extensive datasets, encapsulating critical intellectual property (IP). However, the generated content is prone to malicious exploitation, including spamming and plagiarism. To address the challenges, REMARK-LLM proposes three new components: (i) a learning-based message encoding module to infuse binary signatures into LLM-generated texts; (ii) a reparameterization module to transform the dense distributions from the message encoding to the sparse distribution of the watermarked textual tokens; (iii) a decoding module dedicated for signature extraction; Furthermore, we introduce an optimized beam search algorithm to guarantee the coherence and consistency of the generated content. REMARK-LLM is rigorously trained to encourage the preservation of semantic integrity in watermarked content, while ensuring effective watermark retrieval. Extensive evaluations on multiple unseen datasets highlight REMARK-LLM proficiency and transferability in inserting 2 times more signature bits into the same texts when compared to prior art, all while maintaining semantic integrity. Furthermore, REMARK-LLM exhibits better resilience against a spectrum of watermark detection and removal attacks.
Paper Structure (32 sections, 5 equations, 9 figures, 16 tables, 1 algorithm)

This paper contains 32 sections, 5 equations, 9 figures, 16 tables, 1 algorithm.

Figures (9)

  • Figure 1: LLM-generated text watermarking scenario. The local user sends prompts to the remote LLM cloud API, and the API watermarks (WM) the responded texts before sending them back to users. LLM proprietor claims ownership by using the message decoding module to decode the signatures and compare them with inserted watermarks.
  • Figure 2: REMARK-LLM's Watermarking Framework. The left is an overview of REMARK-LLM: The message encoding module leverages an optimized beam search algorithm to produce coherent watermarked contents. The message decoding module is designed for efficient watermark extraction. The right is REMARK-LLM's training pipeline: The message encoding, reparametrization, and message decoding modules are trained jointly in an end-to-end fashion, aiming to minimize the semantic loss between original text $T$ and watermarked distribution $S(T+M)$, as well as minimize the message recovery loss between the inserted message $M$ and predicted message $M^\prime$.
  • Figure 3: Watermarking strength and semantic preservation comparison of different watermarking frameworks. The threshold for a strong watermark insertion is a z-score of 4, represented as the black dotted line.
  • Figure 4: Watermarking performance under different attacks, including watermark extraction measured by AUC and semantic coherence measured by BERT-S. The attacks are performed on the ChatGPT Abstract dataset with frameworks trained on the HC3 dataset. KGW kirchenbauer2023watermark is the inference-time watermarking framework. AWT abdelnabi21oakland is the neural-based watermark framework. From left to right, we study text edit attacks (deletion, addition, and replacement), text rephrase attacks and re-watermark attacks.
  • Figure 5: Word frequency distribution of original LLM-generated texts and corresponding watermarked texts.
  • ...and 4 more figures