Table of Contents
Fetching ...

Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models

Fay Elhassan, Niccolò Ajroldi, Antonio Orvieto, Jonas Geiping

TL;DR

The paper tackles the challenge of making AI-generated text auditable by embedding a watermark directly into the weights of open-weight LLMs. It introduces a two-LoRA adapter setup (performer and observer) trained end-to-end under the Binoculars detectability objective, with a constrained, regularized optimization to balance watermark strength and linguistic naturalness. By reformulating the objective with barrier-based constraints and evaluating on a fine-tuned LLaMA 3.1 8B model across diverse datasets, the approach achieves ROC-AUC around $0.968$ for watermark detectability while preserving task performance on instruction-following benchmarks. This work demonstrates a viable path toward model-level watermarking that does not rely on inference-time sampling changes, enhancing accountability for open-weight AI systems.

Abstract

The indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. While several methods exist to watermark models behind APIs, embedding watermark strategies directly into model weights that are later reflected in the outputs of the model is challenging. In this study we propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector, so that a subtle watermark is embedded into the text generated by the first model and simultaneously optimized for detectability by the second. In this way, the watermarking strategy is fully learned end-to-end. This process imposes an optimization challenge, as balancing watermark robustness, naturalness, and task performance requires trade-offs. We discuss strategies on how to optimize this min-max objective and present results showing the effect of this modification to instruction finetuning.

Can you Finetune your Binoculars? Embedding Text Watermarks into the Weights of Large Language Models

TL;DR

The paper tackles the challenge of making AI-generated text auditable by embedding a watermark directly into the weights of open-weight LLMs. It introduces a two-LoRA adapter setup (performer and observer) trained end-to-end under the Binoculars detectability objective, with a constrained, regularized optimization to balance watermark strength and linguistic naturalness. By reformulating the objective with barrier-based constraints and evaluating on a fine-tuned LLaMA 3.1 8B model across diverse datasets, the approach achieves ROC-AUC around for watermark detectability while preserving task performance on instruction-following benchmarks. This work demonstrates a viable path toward model-level watermarking that does not rely on inference-time sampling changes, enhancing accountability for open-weight AI systems.

Abstract

The indistinguishability of AI-generated content from human text raises challenges in transparency and accountability. While several methods exist to watermark models behind APIs, embedding watermark strategies directly into model weights that are later reflected in the outputs of the model is challenging. In this study we propose a strategy to finetune a pair of low-rank adapters of a model, one serving as the text-generating model, and the other as the detector, so that a subtle watermark is embedded into the text generated by the first model and simultaneously optimized for detectability by the second. In this way, the watermarking strategy is fully learned end-to-end. This process imposes an optimization challenge, as balancing watermark robustness, naturalness, and task performance requires trade-offs. We discuss strategies on how to optimize this min-max objective and present results showing the effect of this modification to instruction finetuning.

Paper Structure

This paper contains 30 sections, 14 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Training without regularizer ($\lambda = 1e^{-2}$)
  • Figure 2: Exponential regularizer ($\lambda = 1e^{-2}$)
  • Figure 4: ROC curve comparison across different conditions.
  • Figure 5: Precision-Recall curves comparing different experimental conditions.
  • Figure 7: Accuracy trends. Exponential regularization ($\lambda = 1.0e-5, 1.0e-2$) stabilizes training.
  • ...and 2 more figures