Data Provenance Auditing of Fine-Tuned Large Language Models with a Text-Preserving Technique
Yanming Li, Seifeddine Ghozzi, Cédric Eichler, Nicolas Anciaux, Alexandra Bensamoun, Lorena Gonzalez Manzano
TL;DR
This work tackles the challenge of verifying whether sensitive or copyrighted text was used to fine-tune large language models under black-box access. It proposes a text-preserving watermarking framework that embeds invisible Unicode syllables as cue–reply watermarks across documents, enabling post-hoc provenance auditing through black-box prompts. A ranking-based verification against counterfactual watermarks provides a provable bound on the false-positive rate, while a large watermark space supports multi-user attribution. Empirical results on open-weight LLMs show strong detection power, with per-document detection rates typically above 45% even when only a small fraction of data is watermarked, and a ranking test achieving 100%TPR at 0%FPR across thousands of challenges. The approach offers a scalable, minimally invasive tool for rights holders to verify data usage in model fine-tuning, while acknowledging limitations in robustness to aggressive transformations and reliance on trusted third-party watermark assignment, and outlining practical directions for future work and governance.
Abstract
We address the problem of auditing whether sensitive or copyrighted texts were used to fine-tune large language models (LLMs) under black-box access. Prior signals-verbatim regurgitation and membership inference-are unreliable at the level of individual documents or require altering the visible text. We introduce a text-preserving watermarking framework that embeds sequences of invisible Unicode characters into documents. Each watermark is split into a cue (embedded in odd chunks) and a reply (embedded in even chunks). At audit time, we submit prompts that contain only the cue; the presence of the corresponding reply in the model's output provides evidence of memorization consistent with training on the marked text. To obtain sound decisions, we compare the score of the published watermark against a held-out set of counterfactual watermarks and apply a ranking test with a provable false-positive-rate bound. The design is (i) minimally invasive (no visible text changes), (ii) scalable to many users and documents via a large watermark space and multi-watermark attribution, and (iii) robust to common passive transformations. We evaluate on open-weight LLMs and multiple text domains, analyzing regurgitation dynamics, sensitivity to training set size, and interference under multiple concurrent watermarks. Our results demonstrate reliable post-hoc provenance signals with bounded FPR under black-box access. We experimentally observe a failure rate of less than 0.1\% when detecting a reply after fine-tuning with 50 marked documents. Conversely, no spurious reply was recovered in over 18,000 challenges, corresponding to a 100\%TPR@0\% FPR. Moreover, detection rates remain relatively stable as the dataset size increases, maintaining a per-document detection rate above 45\% even when the marked collection accounts for less than 0.33\% of the fine-tuning data.
