Table of Contents
Fetching ...

LLM Fingerprinting via Semantically Conditioned Watermarks

Thibaud Gloaguen, Robin Staab, Nikola Jovanović, Martin Vechev

TL;DR

This paper tackles the problem of proving ownership of open-weight LLMs by moving from brittle, fixed query-key fingerprints to a robust, stealthy paradigm based on semantically conditioned watermarks. By selecting a high-entropy semantic domain (e.g., French) and diffusing a statistical watermark signal across each response, the method enables reliable fingerprint detection even after deployment changes such as finetuning, quantization, or pruning. The authors implement watermark distillation within the semantic domain and preserve non-domain behavior with a regularization term, achieving strong detection while maintaining model utility; detection scales with the number of concatenated responses, and extensive evaluations show robustness against 25 deployment scenarios and 5 targeted adversaries. The work provides a practical, provable approach to model provenance with broad implications for licensing, accountability, and reproducibility in LLM deployment, while acknowledging domain-design tradeoffs and potential misuse concerns.

Abstract

Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.

LLM Fingerprinting via Semantically Conditioned Watermarks

TL;DR

This paper tackles the problem of proving ownership of open-weight LLMs by moving from brittle, fixed query-key fingerprints to a robust, stealthy paradigm based on semantically conditioned watermarks. By selecting a high-entropy semantic domain (e.g., French) and diffusing a statistical watermark signal across each response, the method enables reliable fingerprint detection even after deployment changes such as finetuning, quantization, or pruning. The authors implement watermark distillation within the semantic domain and preserve non-domain behavior with a regularization term, achieving strong detection while maintaining model utility; detection scales with the number of concatenated responses, and extensive evaluations show robustness against 25 deployment scenarios and 5 targeted adversaries. The work provides a practical, provable approach to model provenance with broad implications for licensing, accountability, and reproducibility in LLM deployment, while acknowledging domain-design tradeoffs and potential misuse concerns.

Abstract

Most LLM fingerprinting methods teach the model to respond to a few fixed queries with predefined atypical responses (keys). This memorization often does not survive common deployment steps such as finetuning or quantization, and such keys can be easily detected and filtered from LLM responses, ultimately breaking the fingerprint. To overcome these limitations we introduce LLM fingerprinting via semantically conditioned watermarks, replacing fixed query sets with a broad semantic domain, and replacing brittle atypical keys with a statistical watermarking signal diffused throughout each response. After teaching the model to watermark its responses only to prompts from a predetermined domain e.g., French language, the model owner can use queries from that domain to reliably detect the fingerprint and verify ownership. As we confirm in our thorough experimental evaluation, our fingerprint is both stealthy and robust to all common deployment scenarios.

Paper Structure

This paper contains 103 sections, 1 theorem, 13 equations, 11 figures, 7 tables, 1 algorithm.

Key Result

Theorem D.1

Let $\omega \in \Sigma^*$ a (deduplicated) token sequence sampled independently from $G$. For all $h \in \Sigma^k$, let $m_h := |\{i \le n: h_i = h\}|$. We introduce the effective length Assume that $\max_h m_h = o(\sqrt{n_{\mathrm{eff}}(\omega)})$ and set we have that $Z(\omega)$ follows asymptotically a standard normal distribution.

Figures (11)

  • Figure 1: Illustration of Model FingerprintingLeft: The model owner trains and releases a model in which they have previously embedded a fingerprint. A malicious deployer modifies the model and deploys it behind an API without honoring its restrictive license. Right: In prior work, fingerprint detection relies on specific query-key pairs, which are neither stealthy nor robust to most deployment scenarios. We propose to use semantic domains (e.g., French) and statistical signals (e.g., semantically conditioned LLM watermarks), making the fingerprint stealthy and consistently detectable.
  • Figure 2: Stealth Evaluation FPR (Left) and Recall (Right) (i.e., percentage of detected fingerprint queries/replies over all fingerprint queries/replies) of our GPT5-mini-judge when detecting queries/replies of our fingerprint, IF and SF. A lower recall indicates a stealthier fingerprint.
  • Figure 3: Detectability Against Adversarial Paraphrasing We compare the detectability of our fingerprint (measured by the negative log p-value) with respect to the number of queries $|Q|$ after paraphrasing and adversarial paraphrasing diaa2024optimizing. We generate the replies with Llama3.1-8B, and average all results over $5$ independent runs. In red, we show the fingerprint decision threshold of $1$e$-03$.
  • Figure 4: Robustness Against Finetuning With/Without Semantically Conditioned Watermarks We compare the detectability of our fingerprint with semantically conditioned watermarking and with full watermarking (measured by the negative log p-value) with respect to the number of queries $|Q|$ after finetuning diaa2024optimizing. We generate the replies with Qwen2.5-3B finetuned on Alpaca, Dolly, or OpenMathInstruct, $5$ times independently and average all results. In red, we show the fingerprint decision threshold of $1$e$-03$.
  • Figure 5: Ablation on the Number of Queries We compare the detectability of our fingerprint (measured by the negative log p-value) with respect to the number of queries $|Q|$. We generate the replies with Llama3.1-8B, and average all results over $5$ independent runs. For the finetuned model, we do full finetuning on the French subset of WildChat and for the paraphrased, we paraphrase the output with GPT5-mini. In red, we show the fingerprint decision threshold of $1$e$-03$.
  • ...and 6 more figures

Theorems & Definitions (2)

  • Theorem D.1
  • proof