Table of Contents
Fetching ...

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

Apurv Verma, NhatHai Phan, Shubhendu Trivedi

TL;DR

This is the first empirical study of watermarking-alignment interactions and shows that a simple inference-time fix can recover alignment, using standard results on the expected maximum of Gaussian random variables to derive a theoretical lower bound.

Abstract

Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and selects the most aligned response according to an external reward model. Using standard results on the expected maximum of Gaussian random variables, we derive a theoretical lower bound showing that alignment gains grow sublogarithmically with sample size. In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection. This is the first empirical study of watermarking-alignment interactions; it shows that a simple inference-time fix can recover alignment.

Watermarking Degrades Alignment in Language Models: Analysis and Mitigation

TL;DR

This is the first empirical study of watermarking-alignment interactions and shows that a simple inference-time fix can recover alignment, using standard results on the expected maximum of Gaussian random variables to derive a theoretical lower bound.

Abstract

Watermarking has become a practical tool for tracing language model outputs, but it modifies token probabilities at inference time, which were carefully tuned by alignment training. This creates a tension: how do watermark-induced shifts interact with the procedures intended to make models safe and useful? Experiments on several contemporary models and two representative watermarking schemes reveal that watermarking induces a nontrivial, patterned yet model-specific shift in alignment. We see two failure modes: guard attenuation, where models become more helpful but less safe, and guard amplification, where refusals become overly conservative. These effects persist even after controlling for perplexity degradation, pointing to alignment-specific distortions, not just quality loss. We address this with Alignment Resampling (AR), a procedure that samples multiple watermarked outputs and selects the most aligned response according to an external reward model. Using standard results on the expected maximum of Gaussian random variables, we derive a theoretical lower bound showing that alignment gains grow sublogarithmically with sample size. In practice, sampling as few as two to four candidates largely restores unwatermarked alignment performance in truthfulness, safety, and helpfulness, without hurting watermark detection. This is the first empirical study of watermarking-alignment interactions; it shows that a simple inference-time fix can recover alignment.

Paper Structure

This paper contains 65 sections, 14 equations, 41 figures, 5 tables, 1 algorithm.

Figures (41)

  • Figure 1: Watermarking degrades alignment across multiple dimensions, while Alignment Resampling restores it. We present qualitative examples across three scenarios: Truthfulness (left), Safety (middle), and Overrefusal (right) from LLaMA-8B-Inst model, using the KGW watermark ($\delta=2$, $\gamma=0.25$). The Unwatermarked model (top, green) consistently produces aligned responses. The Watermarked model (middle, red) exhibits systematic degradation: it hallucinates factual details, complies with harmful requests (guard attenuation), or refuses benign queries (guard amplification). Our proposed Alignment Resampling (bottom, blue) successfully mitigates these shifts, recovering the original alignment properties. More examples are provided in Appendix \ref{['appendix:more_examples']}
  • Figure 2: Watermarking reduces model truthfulness, but reward-guided sampling provides effective mitigation. Evaluations use TruthfulQA DBLP:conf/acl/LinHE22 at temperature $\tau=1.0$. Higher scores indicate greater truthfulness. Left panel demonstrates the problem; right panel shows our solution.
  • Figure 3: Watermarking produces divergent safety effects across models. KGW watermarking amplifies unsafe behaviors in economic harm and malware domains, while Phi-3-Mini appears safer through increased conservatism rather than improved safety reasoning (See Appendices \ref{['appendix:safety_prompt']}, \ref{['appendix:safety_dataset']}).
  • Figure 4: Watermarking induces heterogeneous behavioral shifts across models. Left: Changes in unsafe response frequencies reveal model-specific patterns, with some models becoming less safe while others appear safer. Right: Overrefusal analysis exposes the true mechanism behind apparent safety improvements, showing dramatically increased conservative behavior in certain models.
  • Figure 5: Simplex visualization reveals watermarking's impact on alignment trade-offs. Each point represents a model's response distribution across three categories: safe responses, unsafe responses, and overrefusals. Left panel shows watermarking-induced disruptions; right panel demonstrates mitigation through reward-guided sampling.
  • ...and 36 more figures

Theorems & Definitions (1)

  • proof