Table of Contents
Fetching ...

Towards Watermarking of Open-Source LLMs

Thibaud Gloaguen, Nikola Jovanović, Robin Staab, Martin Vechev

TL;DR

This work addresses the problem of watermarking open-source LLMs by recognizing that generation-time watermarks designed for closed models are ineffective when users control decoding. It formalizes durability as a core requirement for open-source watermarking and proposes a systematic evaluation framework against common model modifications (quantization, pruning, merging, finetuning). Across existing OSM watermarks, the study finds no method durable enough to withstand realistic edits, prompting the exploration of distillation-based approaches and a GPT-2 proof-of-concept that larger distillation datasets can improve durability but not fully solve the problem. The results highlight critical challenges for open-source watermarking and propose directions like pretraining-based distillation and task-aware strategies to move toward more robust signals, with significant implications for accountability and provenance in open-weight LLM deployments.

Abstract

While watermarks for closed LLMs have matured and have been included in large-scale deployments, these methods are not applicable to open-source models, which allow users full control over the decoding process. This setting is understudied yet critical, given the rising performance of open-source models. In this work, we lay the foundation for systematic study of open-source LLM watermarking. For the first time, we explicitly formulate key requirements, including durability against common model modifications such as model merging, quantization, or finetuning, and propose a concrete evaluation setup. Given the prevalence of these modifications, durability is crucial for an open-source watermark to be effective. We survey and evaluate existing methods, showing that they are not durable. We also discuss potential ways to improve their durability and highlight remaining challenges. We hope our work enables future progress on this important problem.

Towards Watermarking of Open-Source LLMs

TL;DR

This work addresses the problem of watermarking open-source LLMs by recognizing that generation-time watermarks designed for closed models are ineffective when users control decoding. It formalizes durability as a core requirement for open-source watermarking and proposes a systematic evaluation framework against common model modifications (quantization, pruning, merging, finetuning). Across existing OSM watermarks, the study finds no method durable enough to withstand realistic edits, prompting the exploration of distillation-based approaches and a GPT-2 proof-of-concept that larger distillation datasets can improve durability but not fully solve the problem. The results highlight critical challenges for open-source watermarking and propose directions like pretraining-based distillation and task-aware strategies to move toward more robust signals, with significant implications for accountability and provenance in open-weight LLM deployments.

Abstract

While watermarks for closed LLMs have matured and have been included in large-scale deployments, these methods are not applicable to open-source models, which allow users full control over the decoding process. This setting is understudied yet critical, given the rising performance of open-source models. In this work, we lay the foundation for systematic study of open-source LLM watermarking. For the first time, we explicitly formulate key requirements, including durability against common model modifications such as model merging, quantization, or finetuning, and propose a concrete evaluation setup. Given the prevalence of these modifications, durability is crucial for an open-source watermark to be effective. We survey and evaluate existing methods, showing that they are not durable. We also discuss potential ways to improve their durability and highlight remaining challenges. We hope our work enables future progress on this important problem.

Paper Structure

This paper contains 53 sections, 8 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Definition and evaluation of OSM watermark durability. ①: Given a base unwatermarked model, a watermark is embedded into its weights. ②: The model is uploaded to a model-sharing platform like Hugging Face. ③: The model is tested against the established requirements for generation-time watermarks. ④: Yet, third-party users modify the weights of the model through quantization, pruning, merging, and finetuning, and may distribute the modified model. We ask: are such modified models still watermarked? To evaluate this, we introduce a new requirement: durability, and propose a systematic evaluation procedure based on the most common model modifications.
  • Figure 2: Evaluation of the TPR difference between KGW-D (Pretrained) and KGW-D (Long) when finetuned (as a model modification) on either OpenWebText or OpenMathInstruct.
  • Figure 3: Evolution of different watermark TPRs against multiple quantization methods. Each color corresponds to a different quantization method. The rejection rate $\alpha$ is in logarithmic scale for clarity.
  • Figure 4: Evolution of different watermark TPRs for different SLERP interpolation levels $t$.
  • Figure 5: Evolution of different watermark TPRs averaged over three pruning techniques (Wanda, GBLM, and SparseGPT) at different sparsity ratios $\rho$. The rejection rate $\alpha$ is in logarithmic scale.
  • ...and 3 more figures