Table of Contents
Fetching ...

Learning to Watermark LLM-generated Text via Reinforcement Learning

Xiaojun Xu, Yuanshun Yao, Yang Liu

TL;DR

This work designs a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector, and proposes a co-training framework based on reinforcement learning that iteratively trains a detector to detect the generated watermarked text.

Abstract

We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: https://github.com/xiaojunxu/learning-to-watermark-llm .

Learning to Watermark LLM-generated Text via Reinforcement Learning

TL;DR

This work designs a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector, and proposes a co-training framework based on reinforcement learning that iteratively trains a detector to detect the generated watermarked text.

Abstract

We study how to watermark LLM outputs, i.e. embedding algorithmically detectable signals into LLM-generated text to track misuse. Unlike the current mainstream methods that work with a fixed LLM, we expand the watermark design space by including the LLM tuning stage in the watermark pipeline. While prior works focus on token-level watermark that embeds signals into the output, we design a model-level watermark that embeds signals into the LLM weights, and such signals can be detected by a paired detector. We propose a co-training framework based on reinforcement learning that iteratively (1) trains a detector to detect the generated watermarked text and (2) tunes the LLM to generate text easily detectable by the detector while keeping its normal utility. We empirically show that our watermarks are more accurate, robust, and adaptable (to new attacks). It also allows watermarked model open-sourcing. In addition, if used together with alignment, the extra overhead introduced is low - only training an extra reward model (i.e. our detector). We hope our work can bring more effort into studying a broader watermark design that is not limited to working with a fixed LLM. We open-source the code: https://github.com/xiaojunxu/learning-to-watermark-llm .
Paper Structure (31 sections, 3 equations, 4 figures, 9 tables, 1 algorithm)

This paper contains 31 sections, 3 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of our framework compared to the prior works. Left: The prior methods kirchenbauer2023watermarkkuditipudi2023robust focus on working with a fixed model. They induce distortions into the LLM output distribution used as the detection signal. Right: Our approach injects watermark into the LLM weights by finetuning. The watermark is propagated to the output and detected by a paired detector co-trained with the LLM in an RLHF framework, where a reward model can serve as the detector.
  • Figure 2: Detection performance of the watermarked text under word substitution attacks.
  • Figure 3: Detection performance of the watermarked text under paraphrasing attacks with Pegasus.
  • Figure 4: Detection performance of the watermarked text adversarially trained with Pegasus paraphrasing, tested with DIPPER paraphrasing.