Table of Contents
Fetching ...

PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

Jiaqi Xue, Yifei Zhao, Mansour Al Ghanim, Shangqian Gao, Ruimin Sun, Qian Lou, Mengxin Zheng

TL;DR

This paper addresses the challenge of watermarking open-source LLMs, where attackers can modify weights and decoding cannot be controlled. It introduces PRO, which jointly trains a learnable watermark policy with the LLM via CAWP to produce detection-aligned watermark logits in the form $\delta\cdot M(E(x_{i-n:i-1}))$, and enhances robustness with Forgotten Perturbation-aware Learning (FPL) that simulates perturbations to prevent watermark forgetting. The approach achieves high detectability (AUC approaching 0.99) with minimal degradation to text quality and shows strong resilience to real-world weight modifications such as quantization, pruning, merging, and fine-tuning, outperforming prior open-source watermarking methods. PRO’s gains suggest practical viability for provenance verification and IP protection in open-source LLM deployment, narrowing the gap with closed-source watermarking capabilities while maintaining end-to-end detectability on generated text.

Abstract

Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO jointly trains a watermark policy model with the LLM, producing patterns that are easier for the model to learn and more consistent with detection criteria. A regularization term further simulates downstream perturbations and penalizes degradation in watermark detectability, ensuring robustness under model edits. Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO substantially improves both watermark detectability and resilience to model modifications.

PRO: Enabling Precise and Robust Text Watermark for Open-Source LLMs

TL;DR

This paper addresses the challenge of watermarking open-source LLMs, where attackers can modify weights and decoding cannot be controlled. It introduces PRO, which jointly trains a learnable watermark policy with the LLM via CAWP to produce detection-aligned watermark logits in the form , and enhances robustness with Forgotten Perturbation-aware Learning (FPL) that simulates perturbations to prevent watermark forgetting. The approach achieves high detectability (AUC approaching 0.99) with minimal degradation to text quality and shows strong resilience to real-world weight modifications such as quantization, pruning, merging, and fine-tuning, outperforming prior open-source watermarking methods. PRO’s gains suggest practical viability for provenance verification and IP protection in open-source LLM deployment, narrowing the gap with closed-source watermarking capabilities while maintaining end-to-end detectability on generated text.

Abstract

Text watermarking for large language models (LLMs) enables model owners to verify text origin and protect intellectual property. While watermarking methods for closed-source LLMs are relatively mature, extending them to open-source models remains challenging, as developers cannot control the decoding process. Consequently, owners of open-source LLMs lack practical means to verify whether text was generated by their models. A core difficulty lies in embedding watermarks directly into model weights without hurting detectability. A promising idea is to distill watermarks from a closed-source model into an open one, but this suffers from (i) poor detectability due to mismatch between learned and predefined patterns, and (ii) fragility to downstream modifications such as fine-tuning or model merging. To overcome these limitations, we propose PRO, a Precise and Robust text watermarking method for open-source LLMs. PRO jointly trains a watermark policy model with the LLM, producing patterns that are easier for the model to learn and more consistent with detection criteria. A regularization term further simulates downstream perturbations and penalizes degradation in watermark detectability, ensuring robustness under model edits. Experiments on open-source LLMs (e.g., LLaMA-3.2, LLaMA-3, Phi-2) show that PRO substantially improves both watermark detectability and resilience to model modifications.

Paper Structure

This paper contains 50 sections, 24 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Text watermarking for (a) closed-source and (b) open-source LLMs. Closed-source watermarking relies on watermark decoding, while open-source watermarking requires embedding the watermark into the model weights so that standard decoding still produces watermarked text.
  • Figure 2: (Left) learning-based and decoding-based KGW watermarks under varying watermark hyperparameters $n$ and $\delta$. (Right) existing learning-based watermarks' AUC after merging with the unwatermarked LLM.
  • Figure 3: Overview of our proposed method. CAWP (upper) jointly trains a watermark model with the student LLM to generate learning-friendly watermark logits. FPL (bottom) improves robustness by searching and minimizing the effect of forgotten perturbations that may erase the watermark.
  • Figure 4: Green token ratios for unwatermarked, watermark student, and watermark teacher (KGW with $\text{green ratio}=0.25$, $n=1$, $\delta=1$).
  • Figure 5: AUC during fine-tuning on raw, watermarked, and anti-watermarked texts.
  • ...and 4 more figures