Table of Contents
Fetching ...

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

Yizhou Zhao, Zhiwei Steven Wu, Adam Block

TL;DR

MarkTune addresses the challenge of watermarking open-weight LLMs by moving from static weight perturbations (GaussMark) to an on-policy fine-tuning regime that treats the watermark signal as a learnable reward. By optimizing a dual objective that rewards watermark detectability while regularizing text quality, MarkTune steers updates toward watermark-sensitive directions near a high-quality reference distribution, achieving a quality-detectability frontier close to inference-time methods. The approach preserves false-positive guarantees, generalizes across datasets, and demonstrates robustness to paraphrasing and substantial fine-tuning. Empirically, it outperforms prior model-embedded schemes on detection strength with minimal degradation to downstream performance, suggesting a practical, general strategy for embedding robust watermarks in open-weight LMs.

Abstract

Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.

MarkTune: Improving the Quality-Detectability Trade-off in Open-Weight LLM Watermarking

TL;DR

MarkTune addresses the challenge of watermarking open-weight LLMs by moving from static weight perturbations (GaussMark) to an on-policy fine-tuning regime that treats the watermark signal as a learnable reward. By optimizing a dual objective that rewards watermark detectability while regularizing text quality, MarkTune steers updates toward watermark-sensitive directions near a high-quality reference distribution, achieving a quality-detectability frontier close to inference-time methods. The approach preserves false-positive guarantees, generalizes across datasets, and demonstrates robustness to paraphrasing and substantial fine-tuning. Empirically, it outperforms prior model-embedded schemes on detection strength with minimal degradation to downstream performance, suggesting a practical, general strategy for embedding robust watermarks in open-weight LMs.

Abstract

Watermarking aims to embed hidden signals in generated text that can be reliably detected when given access to a secret key. Open-weight language models pose acute challenges for such watermarking schemes because the inference-time interventions that dominate contemporary approaches cannot be enforced once model weights are public. Existing watermaking techniques for open-weight models, such as the recently proposed GaussMark, typically rely on small modifications to model weights, which can yield signals detectable to those equipped with a secret key, but achieving detection power comparable to inference-time watermarks generally requires weight perturbations that noticeably reduce generation quality. We introduce MarkTune, a theoretically principled, on-policy fine-tuning framework that treats the GaussMark signal as a reward while simultaneously regularizing against degradation in text quality. We derive MarkTune as an improvement on GaussMark and demonstrate that MarkTune consistently improves the quality-detectability trade-off over GaussMark by steering finer-grained, watermark-aware weight updates within the model's representation space while preserving generation quality. Empirically, we show that MarkTune pushes the quality-detectability frontier of GaussMark close to that of inference-time watermarking, remains robust to paraphrasing and fine-tuning attacks, and exhibits strong generalization: a model fine-tuned on one dataset retains substantial watermark detection power on unseen datasets. Together, these results establish MarkTune as a general strategy for embedding robust, high-quality watermarks into open-weight LMs.

Paper Structure

This paper contains 47 sections, 5 theorems, 37 equations, 5 figures, 7 tables, 2 algorithms.

Key Result

Proposition 1

Let $p_\theta$ be a language model with parameters $\theta = (\theta_{\mathrm{wm}}, \theta_0) \in \Theta$, where $\theta_{\mathrm{wm}} \in \mathbb{R}^{d_r}$ is the subset of parameters to which the GaussMark is applied. Assume that the map $\theta' \mapsto \mathrm{D_{KL}}\!\left(p_{\theta'}(\cdot \m

Figures (5)

  • Figure 1: Trade-off between detectability (TPR@1% FPR) and text quality (Perplexity) across various watermarking schemes. Inference-time watermarking methods (KGW 2023KGW, Gumbel-max 2023Aaronson, SynthID 2024scalable) modify only the sampling process and are shown here for reference, as they are not applicable in open-weight settings. Model-embedded watermarking methods (GaussMark2025gaussmark and our MarkTune) embed the watermark directly into the model weights. MarkTune substantially improves the trade-off over GaussMark and achieves performance comparable to inference-time watermarking methods. The black "×" marks the MarkTune configuration used in Section \ref{['sec:exper']}.
  • Figure 2: Overview of our framework compared to prior work 2023KGW2023Aaronson2024scalable2025gaussmark. Left: Inference-time watermarking schemes break down on open-weight LLMs because users can disable the decoding algorithm, and these methods often introduce substantial generation latency (indicates no latency; indicates extra latency). Right: Our approach, MarkTune, treats the GaussMark test statistic as a reward and performs on-policy fine-tuning to embed a highly detectable yet quality-preserving watermark signal into the model’s weights.
  • Figure 3: Stylized one-dimensional landscape along a watermark-sensitive direction. Both $\theta_{\textsc{GaussMark}}$ and $\theta_{\textsc{MarkTune}}$ lie at nontrivial distances from the base model $\theta$ along this direction, leading to significant watermark detectability, but $\theta_{\textsc{MarkTune}}$ resides within the flat high-quality basin around $\theta^\star$ and therefore incurs substantially less quality degradation than $\theta_{\textsc{GaussMark}}$.
  • Figure 4: Relative downstream task accuracy compared to unwatermarked models across the general, math, and coding benchmarks.
  • Figure 5: Detectability (TPR@1%FPR) decay under LoRA fine-tuning attack.

Theorems & Definitions (11)

  • Proposition 1
  • Remark 1
  • Proposition 2
  • proof
  • proof
  • Lemma 1: Closed-form optimizer in parameter space
  • proof
  • Lemma 2: Reward gradient in the linear-softmax model
  • proof
  • Proposition 3: Second-order CE cost and first-order reward gain
  • ...and 1 more