Table of Contents
Fetching ...

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

Zhuohao Yu, Xingru Jiang, Weizheng Gu, Yidong Wang, Qingsong Wen, Shikun Zhang, Wei Ye

TL;DR

SAEMark tackles the problem of scalable, attribution-ready watermarking for AI generated text in API based and multilingual settings. It introduces a general framework that uses inference-time feature based rejection sampling to encode a multi bit message without altering model logits, leveraging a Sparse Autoencoder to extract deterministic semantic features. The approach comes with theoretical guarantees that embedding success scales with compute budgets and is instantiated via the Feature Concentration Score, yielding high detection accuracy and text quality across diverse domains. Empirical results on four datasets show near perfect per unit accuracy and strong multi bit performance, while maintaining output quality and practical latency, enabling out of the box attribution for closed source LLMs. Overall, SAEMark establishes a scalable, language-agnostic paradigm for content attribution that works with API based LLMs and arbitrary feature extractors.

Abstract

Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

SAEMark: Steering Personalized Multilingual LLM Watermarks with Sparse Autoencoders

TL;DR

SAEMark tackles the problem of scalable, attribution-ready watermarking for AI generated text in API based and multilingual settings. It introduces a general framework that uses inference-time feature based rejection sampling to encode a multi bit message without altering model logits, leveraging a Sparse Autoencoder to extract deterministic semantic features. The approach comes with theoretical guarantees that embedding success scales with compute budgets and is instantiated via the Feature Concentration Score, yielding high detection accuracy and text quality across diverse domains. Empirical results on four datasets show near perfect per unit accuracy and strong multi bit performance, while maintaining output quality and practical latency, enabling out of the box attribution for closed source LLMs. Overall, SAEMark establishes a scalable, language-agnostic paradigm for content attribution that works with API based LLMs and arbitrary feature extractors.

Abstract

Watermarking LLM-generated text is critical for content attribution and misinformation prevention. However, existing methods compromise text quality, require white-box model access and logit manipulation. These limitations exclude API-based models and multilingual scenarios. We propose SAEMark, a general framework for post-hoc multi-bit watermarking that embeds personalized messages solely via inference-time, feature-based rejection sampling without altering model logits or requiring training. Our approach operates on deterministic features extracted from generated text, selecting outputs whose feature statistics align with key-derived targets. This framework naturally generalizes across languages and domains while preserving text quality through sampling LLM outputs instead of modifying. We provide theoretical guarantees relating watermark success probability and compute budget that hold for any suitable feature extractor. Empirically, we demonstrate the framework's effectiveness using Sparse Autoencoders (SAEs), achieving superior detection accuracy and text quality. Experiments across 4 datasets show SAEMark's consistent performance, with 99.7% F1 on English and strong multi-bit detection accuracy. SAEMark establishes a new paradigm for scalable watermarking that works out-of-the-box with closed-source LLMs while enabling content attribution.

Paper Structure

This paper contains 65 sections, 11 equations, 14 figures, 3 tables, 3 algorithms.

Figures (14)

  • Figure 1: An overview of SAEMark.
  • Figure 2: Watermark generation
  • Figure 3: Distribution analysis of FCS. FCS distribution with density estimation (left) and Q-Q plot (right); statistical tests support approximate normality.
  • Figure 5: Adversarial robustness. ROC curves showing robust performance against three attack types with varying intensities.
  • Figure 6: Multi-bit scaling and information density. Watermark acc. across different message bits at fixed text length, demonstrating superior information density compared to multi-bit baselines with $\ge$ 90% acc. up to 10 bits.
  • ...and 9 more figures