Table of Contents
Fetching ...

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

Pranav Shetty, Mirazul Haque, Petr Babkin, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso

TL;DR

This work tackles the challenge of proving when training data contains copyrighted or licensed material by introducing SPECTRA, a paraphrase-guided data watermarking method. It watermarking pre-publication data via paraphrase generation and a careful score-based selection, enabling post-training verification without requiring access to the model’s decoding layers or a non-member dataset. The method leverages Min-K%++ scores and a distribution-aware sampling strategy to maintain original data characteristics while creating a detectable watermark, validated through a grey-box verification procedure and statistically significant p-value gaps across multiple datasets and model configurations. Empirically, SPECTRA demonstrates robust performance, outperforming existing baselines, achieving large p-value separations, and maintaining paraphrase quality, with well-defined limitations and avenues for future work in broader verification and legal applicability.

Abstract

Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.

Perturb Your Data: Paraphrase-Guided Training Data Watermarking

TL;DR

This work tackles the challenge of proving when training data contains copyrighted or licensed material by introducing SPECTRA, a paraphrase-guided data watermarking method. It watermarking pre-publication data via paraphrase generation and a careful score-based selection, enabling post-training verification without requiring access to the model’s decoding layers or a non-member dataset. The method leverages Min-K%++ scores and a distribution-aware sampling strategy to maintain original data characteristics while creating a detectable watermark, validated through a grey-box verification procedure and statistically significant p-value gaps across multiple datasets and model configurations. Empirically, SPECTRA demonstrates robust performance, outperforming existing baselines, achieving large p-value separations, and maintaining paraphrase quality, with well-defined limitations and avenues for future work in broader verification and legal applicability.

Abstract

Training data detection is critical for enforcing copyright and data licensing, as Large Language Models (LLM) are trained on massive text corpora scraped from the internet. We present SPECTRA, a watermarking approach that makes training data reliably detectable even when it comprises less than 0.001% of the training corpus. SPECTRA works by paraphrasing text using an LLM and assigning a score based on how likely each paraphrase is, according to a separate scoring model. A paraphrase is chosen so that its score closely matches that of the original text, to avoid introducing any distribution shifts. To test whether a suspect model has been trained on the watermarked data, we compare its token probabilities against those of the scoring model. We demonstrate that SPECTRA achieves a consistent p-value gap of over nine orders of magnitude when detecting data used for training versus data not used for training, which is greater than all baselines tested. SPECTRA equips data owners with a scalable, deploy-before-release watermark that survives even large-scale LLM training.

Paper Structure

This paper contains 37 sections, 7 equations, 9 figures, 10 tables, 1 algorithm.

Figures (9)

  • Figure 1: Overview of our problem setting. I/P and O/P refer to input and output, respectively.
  • Figure 2: Overview of SPECTRA. Watermarking Phase: We use an LLM to generate multiple paraphrases of the original text. We sample one paraphrase that has a Min-K%++ score close to the original text. Verification Phase: Given a target LLM suspected of being trained on the watermarked data, we compute the Min-K%++ scores of the watermarked and original data and compare against the scores previously generated by the scoring model. Membership is detected through a paired t-test.
  • Figure 3: Heatmap showing fraction of evaluator scores for which the paraphrases received a rating $\geq 3$.
  • Figure 4: Trend of p-values with number of samples
  • Figure 5: Trend of p-values with the number of pre-training tokens. The dashed red line indicates the p-value threshold of $10^{-4}$
  • ...and 4 more figures