Table of Contents
Fetching ...

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

Yanming Li, Seifeddine Ghozzi, Cédric Eichler, Nicolas Anciaux, Alexandra Bensamoun, Lorena Gonzalez Manzano

TL;DR

This work tackles the challenge of verifying whether sensitive or copyrighted text was used to fine-tune large language models under black-box access. It proposes a text-preserving watermarking framework that embeds invisible Unicode syllables as cue–reply watermarks across documents, enabling post-hoc provenance auditing through black-box prompts. A ranking-based verification against counterfactual watermarks provides a provable bound on the false-positive rate, while a large watermark space supports multi-user attribution. Empirical results on open-weight LLMs show strong detection power, with per-document detection rates typically above 45% even when only a small fraction of data is watermarked, and a ranking test achieving 100%TPR at 0%FPR across thousands of challenges. The approach offers a scalable, minimally invasive tool for rights holders to verify data usage in model fine-tuning, while acknowledging limitations in robustness to aggressive transformations and reliance on trusted third-party watermark assignment, and outlining practical directions for future work and governance.

Abstract

We address the problem of auditing whether sensitive or copyrighted texts were used to fine-tune large language models (LLMs) under black-box access. Prior signals-verbatim regurgitation and membership inference-are unreliable at the level of individual documents or require altering the visible text. We introduce a text-preserving watermarking framework that embeds sequences of invisible Unicode characters into documents. Each watermark is split into a cue (embedded in odd chunks) and a reply (embedded in even chunks). At audit time, we submit prompts that contain only the cue; the presence of the corresponding reply in the model's output provides evidence of memorization consistent with training on the marked text. To obtain sound decisions, we compare the score of the published watermark against a held-out set of counterfactual watermarks and apply a ranking test with a provable false-positive-rate bound. The design is (i) minimally invasive (no visible text changes), (ii) scalable to many users and documents via a large watermark space and multi-watermark attribution, and (iii) robust to common passive transformations. We evaluate on open-weight LLMs and multiple text domains, analyzing regurgitation dynamics, sensitivity to training set size, and interference under multiple concurrent watermarks. Our results demonstrate reliable post-hoc provenance signals with bounded FPR under black-box access. We experimentally observe a failure rate of less than 0.1\% when detecting a reply after fine-tuning with 50 marked documents. Conversely, no spurious reply was recovered in over 18,000 challenges, corresponding to a 100\%TPR@0\% FPR. Moreover, detection rates remain relatively stable as the dataset size increases, maintaining a per-document detection rate above 45\% even when the marked collection accounts for less than 0.33\% of the fine-tuning data.

Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique

TL;DR

This work tackles the challenge of verifying whether sensitive or copyrighted text was used to fine-tune large language models under black-box access. It proposes a text-preserving watermarking framework that embeds invisible Unicode syllables as cue–reply watermarks across documents, enabling post-hoc provenance auditing through black-box prompts. A ranking-based verification against counterfactual watermarks provides a provable bound on the false-positive rate, while a large watermark space supports multi-user attribution. Empirical results on open-weight LLMs show strong detection power, with per-document detection rates typically above 45% even when only a small fraction of data is watermarked, and a ranking test achieving 100%TPR at 0%FPR across thousands of challenges. The approach offers a scalable, minimally invasive tool for rights holders to verify data usage in model fine-tuning, while acknowledging limitations in robustness to aggressive transformations and reliance on trusted third-party watermark assignment, and outlining practical directions for future work and governance.

Abstract

We address the problem of auditing whether sensitive or copyrighted texts were used to fine-tune large language models (LLMs) under black-box access. Prior signals-verbatim regurgitation and membership inference-are unreliable at the level of individual documents or require altering the visible text. We introduce a text-preserving watermarking framework that embeds sequences of invisible Unicode characters into documents. Each watermark is split into a cue (embedded in odd chunks) and a reply (embedded in even chunks). At audit time, we submit prompts that contain only the cue; the presence of the corresponding reply in the model's output provides evidence of memorization consistent with training on the marked text. To obtain sound decisions, we compare the score of the published watermark against a held-out set of counterfactual watermarks and apply a ranking test with a provable false-positive-rate bound. The design is (i) minimally invasive (no visible text changes), (ii) scalable to many users and documents via a large watermark space and multi-watermark attribution, and (iii) robust to common passive transformations. We evaluate on open-weight LLMs and multiple text domains, analyzing regurgitation dynamics, sensitivity to training set size, and interference under multiple concurrent watermarks. Our results demonstrate reliable post-hoc provenance signals with bounded FPR under black-box access. We experimentally observe a failure rate of less than 0.1\% when detecting a reply after fine-tuning with 50 marked documents. Conversely, no spurious reply was recovered in over 18,000 challenges, corresponding to a 100\%TPR@0\% FPR. Moreover, detection rates remain relatively stable as the dataset size increases, maintaining a per-document detection rate above 45\% even when the marked collection accounts for less than 0.33\% of the fine-tuning data.

Paper Structure

This paper contains 35 sections, 10 equations, 4 figures, 6 tables, 3 algorithms.

Figures (4)

  • Figure 1: Overview of the proposal. The main logical steps are indicated by black circles numbered 1 to 6. : A new sensitive dataset is produced. : A new batch of $K$ (e.g., 100) watermarks, each consisting of invisible Unicode characters, is selected. One of them is randomly chosen (here w$_1$) for publication, while the other $K–1$ are reserved for the final ranking to ensure statistically grounded decisions. : The dataset is watermarked $K$ times, once with each watermark. : The publishable dataset is the one marked with w$_1$. : A suspicious chatbot is based on a model fine-tuned on a dataset potentially containing sensitive documents (or parts of them). : This model is probed via black-box access: each odd (green) chunk of a marked document (with watermarks w$_1$ to w$_k$) is submitted, and the outputs are analyzed to detect the presence of the corresponding watermark’s "reply" sequence of syllables (represented by the blue dots and present in even chunks only). Finally, for each watermark and each document marked with it, the number of expected occurrences with a "reply" component of the watermark in the output is counted. The watermarks are then ranked by frequency. If the publication watermark ranks high enough (above a fixed threshold $k$), this indicates that sensitive data was included in the model’s fine-tuning set.
  • Figure 2: Number of regurgitated replies depending on collection size. Curves show average regurgitated replies; shaded areas indicate $\pm$ standard deviation.
  • Figure 3: Average regurgitated replies with $\pm$ std across varying dataset sizes with fixed collection size. Left: Blog1k ($\mathcal{D}\xspace$ size: Mistral=50, LLaMA=40). Right: Poems ($\mathcal{D}\xspace$ size: Mistral=60, LLaMA=40). Solid = Mistral, dashed = LLaMA.
  • Figure 4: Distribution of successful document-level challenges per watermark. Each figure corresponds to a specific model–dataset pair.