Table of Contents
Fetching ...

CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models

Yu Zhang, Shuliang Liu, Xu Yang, Xuming Hu

TL;DR

CATMark tackles the challenge of watermarking LLM outputs across heterogeneous generation tasks by introducing a context-aware, threshold-adaptive framework. It clusters token generation contexts via logit distributions, computes per-context entropy thresholds from historical data, and applies watermark bias selectively to high-entropy tokens. The approach eliminates pre-set thresholds, preserves fidelity in low-entropy content like code, and achieves robust detection as shown by strong AUROC and pass@k across code, math, and StackEval benchmarks. It also withstands rewriting attacks like back-translation and paraphrasing, with acceptable computational overhead, making it practical for real-world use.

Abstract

Watermarking algorithms for Large Language Models (LLMs) effectively identify machine-generated content by embedding and detecting hidden statistical features in text. However, such embedding leads to a decline in text quality, especially in low-entropy scenarios where performance needs improvement. Existing methods that rely on entropy thresholds often require significant computational resources for tuning and demonstrate poor adaptability to unknown or cross-task generation scenarios. We propose \textbf{C}ontext-\textbf{A}ware \textbf{T}hreshold watermarking ($\myalgo$), a novel framework that dynamically adjusts watermarking intensity based on real-time semantic context. $\myalgo$ partitions text generation into semantic states using logits clustering, establishing context-aware entropy thresholds that preserve fidelity in structured content while embedding robust watermarks. Crucially, it requires no pre-defined thresholds or task-specific tuning. Experiments show $\myalgo$ improves text quality in cross-tasks without sacrificing detection accuracy.

CATMark: A Context-Aware Thresholding Framework for Robust Cross-Task Watermarking in Large Language Models

TL;DR

CATMark tackles the challenge of watermarking LLM outputs across heterogeneous generation tasks by introducing a context-aware, threshold-adaptive framework. It clusters token generation contexts via logit distributions, computes per-context entropy thresholds from historical data, and applies watermark bias selectively to high-entropy tokens. The approach eliminates pre-set thresholds, preserves fidelity in low-entropy content like code, and achieves robust detection as shown by strong AUROC and pass@k across code, math, and StackEval benchmarks. It also withstands rewriting attacks like back-translation and paraphrasing, with acceptable computational overhead, making it practical for real-world use.

Abstract

Watermarking algorithms for Large Language Models (LLMs) effectively identify machine-generated content by embedding and detecting hidden statistical features in text. However, such embedding leads to a decline in text quality, especially in low-entropy scenarios where performance needs improvement. Existing methods that rely on entropy thresholds often require significant computational resources for tuning and demonstrate poor adaptability to unknown or cross-task generation scenarios. We propose \textbf{C}ontext-\textbf{A}ware \textbf{T}hreshold watermarking (), a novel framework that dynamically adjusts watermarking intensity based on real-time semantic context. partitions text generation into semantic states using logits clustering, establishing context-aware entropy thresholds that preserve fidelity in structured content while embedding robust watermarks. Crucially, it requires no pre-defined thresholds or task-specific tuning. Experiments show improves text quality in cross-tasks without sacrificing detection accuracy.

Paper Structure

This paper contains 35 sections, 2 theorems, 16 equations, 11 figures, 3 tables, 2 algorithms.

Key Result

Theorem 1

Given a token sequence $y = \{y_0,\dots,y_{N-1}\}$ generated by a watermarked LLM, let $(S_0, \dots, S_{N-1})$ be the corresponding sequence of spike entropies. If a token $y_j$ satisfies the low-entropy condition then excluding this token from the z-score calculation, as is done in $\textsc{CATMark}$, results in a higher lower bound on the z-score compared to including it, as in $\textsc{EWD}$.

Figures (11)

  • Figure 1: Comparison between static-threshold watermarking and our context-aware, cluster-based thresholding method, $\textsc{CATMark}$. Our approach dynamically clusters generated tokens based on logit similarity (left panel), then computes a context-specific entropy threshold per cluster using historical entropy sequences (middle panel). Tokens whose entropy exceeds the adaptive threshold are watermarked (right panel). In the token sequence visualizations, rectangle height represents normalized entropy.
  • Figure 2: Hyperparameter sensitivity analysis for $\textsc{CATMark}$ with $\gamma = 0.5$ and $\delta = 2.0$ fixed. Subfigure \ref{['fig:line']} displays performance stability on HumanEval and MBPP across similarity thresholds $\alpha \in \{-2, -4, -6, -8, -10\}$ and minimum entropy sequence lengths $\rho \in \{1, 2, 3, 4, 5\}$. Subfigure \ref{['fig:purity']} illustrates the impact of $\alpha$ on the proportion of pure token categories with $\rho = 1$.
  • Figure 3: Watermark detection performance against two attacks. We set $\gamma$= 0.5 and $\delta$= 2.0 for watermark methods and $\rho$ = 5, $\alpha$ = -2 for $\textsc{CATMark}$.
  • Figure 4: KGW-MBPP
  • Figure 5: SWEET-MBPP
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 1
  • Lemma F.1
  • proof