Proving membership in LLM pretraining data via data watermarks

Johnny Tian-Zheng Wei; Ryan Yixiang Wang; Robin Jia

Proving membership in LLM pretraining data via data watermarks

Johnny Tian-Zheng Wei, Ryan Yixiang Wang, Robin Jia

TL;DR

This work tackles proving that copyright holders’ data were used to train LLMs by introducing data watermarks that enable statistical detection under a hypothesis-testing framework with black-box access. It proposes two watermark families—random sequence additions and Unicode lookalike substitutions—and analyzes how watermark design, scaling, and interference affect detection power. The study demonstrates that watermark strength generally scales with model size and can be maintained under data growth, with empirical validation on BLOOM-176B using natural hash occurrences as a post-hoc watermark. The results support the feasibility of data watermarks for real-world rights enforcement, including a practical demonstration that hashes duplicated sufficiently often can be robustly detected in large models.

Abstract

Detecting whether copyright holders' works were used in LLM pretraining is poised to be an important problem. This work proposes using data watermarks to enable principled detection with only black-box model access, provided that the rightholder contributed multiple training documents and watermarked them before public release. By applying a randomly sampled data watermark, detection can be framed as hypothesis testing, which provides guarantees on the false detection rate. We study two watermarks: one that inserts random sequences, and another that randomly substitutes characters with Unicode lookalikes. We first show how three aspects of watermark design -- watermark length, number of duplications, and interference -- affect the power of the hypothesis test. Next, we study how a watermark's detection strength changes under model and dataset scaling: while increasing the dataset size decreases the strength of the watermark, watermarks remain strong if the model size also increases. Finally, we view SHA hashes as natural watermarks and show that we can robustly detect hashes from BLOOM-176B's training data, as long as they occurred at least 90 times. Together, our results point towards a promising future for data watermarks in real world use.

Proving membership in LLM pretraining data via data watermarks

TL;DR

Abstract

Paper Structure (52 sections, 9 figures, 1 table)

This paper contains 52 sections, 9 figures, 1 table.

Introduction
Related work
Dataset membership.
Membership inference.
Memorization.
Data watermarks
Testing for data watermarks
$Z$-scores.
Random sequence watermark
Perturbation.
Scoring function.
Unicode watermark
Global perturbation.
Word-level perturbation.
Scoring function.
...and 37 more sections

Figures (9)

Figure 1: An illustration of hypothesis testing for membership inference. The rightholder inserts "$\texttt{MPadd*t6Ex}$" across their document collection before public release, which was sampled from a distribution of random sequences. The model's average token loss on all the random sequences forms a null distribution, and the loss on the included watermark is the test statistic. The effectiveness of hypothesis test is determined by the effect size and variance of the null distribution.
Figure 2: Experiments on random sequence watermarks relating its length and the number of watermarked documents to the detection strength. Results in (a) are averaged over 5 runs, and (b) and (c) visualizes the null distribution and test statistic for one run. Lower negative $Z$-scores indicate stronger watermarks. (a) Watermark strength increases as the documents increase, but tapers out quickly. Watermark length determines the eventual strength. (b) Fixing a watermark length, as the number of watermarked documents increases, the watermark loss decreases. (c) Fixing the number of watermarked documents, as the watermark length increases, the null distribution's variance decreases.
Figure 3: Experiments on Unicode variants and interference. (a) and (b) are averaged over 5 runs and (c) visualizes the null distribution and test statistic on one run. (a) Word-level Unicode watermarks outperforms the global variant. (b) Inserting multiple independent Unicode watermarks (256 docs per experiment) causes their strengths to degrade, but random sequences are not affected by interference. (c) For the word-level Unicode watermark, as more independent watermarks are inserted, the null distribution shifts down, causing the strength to drop.
Figure 4: Experiments on random sequence watermarks under model and dataset scaling. All experiments watermark 256 documents with a length 80 random sequence. Results in (a) are averaged over 3 runs, and (b) and (c) visualize the null distribution and test statistic for one run. (a) When scaling the training data, watermarks become weaker. However, watermarks remain strong for larger models. (b) As dataset size scales, the watermark loss of the 70M model increases. (c) As dataset size scales, the watermark loss of the 410M model roughly remains constant.
Figure 5: Test results for BLOOM-176B on SHA and MD5 hashes naturally occurring in StackExchange. Occurrences are collected from the ROOTS search tool and multiple occurrences may appear in the same document. A SHA-512 hash occurring 12 times can achieve $10$-sigma detection. The dotted lines denotes a threshold of $Z=-2$ and a false detection rate of $\alpha<5\%$. Empirically, robust detection is possible past 90 occurrences.
...and 4 more figures

Proving membership in LLM pretraining data via data watermarks

TL;DR

Abstract

Proving membership in LLM pretraining data via data watermarks

Authors

TL;DR

Abstract

Table of Contents

Figures (9)