Table of Contents
Fetching ...

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

Jingqi Zhang, Ruibo Chen, Yingqing Yang, Peihua Mai, Heng Huang, Yan Pang

TL;DR

TRACE provides a practical, fully black-box method to verify copyrighted dataset usage in LLM fine-tuning by watermarking datasets with distortion-free rewrites guided by a private key and detecting the watermark via an entropy-gated analysis of model outputs. The approach yields statistically significant evidence across diverse datasets and model families, enabling multi-dataset attribution and demonstrating robustness to continued pretraining while preserving text quality and downstream performance. The method relies on a two-stage process: (i) watermarked dataset rewriting and (ii) black-box detection that concentrates on high-uncertainty token positions to amplify signal. These results offer a scalable, private-key-based mechanism for rights holders to verify data usage in commercial LLMs and establish TRACE as a practical route for dataset copyright protection in real-world deployments.

Abstract

Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.

Leave No TRACE: Black-box Detection of Copyrighted Dataset Usage in Large Language Models via Watermarking

TL;DR

TRACE provides a practical, fully black-box method to verify copyrighted dataset usage in LLM fine-tuning by watermarking datasets with distortion-free rewrites guided by a private key and detecting the watermark via an entropy-gated analysis of model outputs. The approach yields statistically significant evidence across diverse datasets and model families, enabling multi-dataset attribution and demonstrating robustness to continued pretraining while preserving text quality and downstream performance. The method relies on a two-stage process: (i) watermarked dataset rewriting and (ii) black-box detection that concentrates on high-uncertainty token positions to amplify signal. These results offer a scalable, private-key-based mechanism for rights holders to verify data usage in commercial LLMs and establish TRACE as a practical route for dataset copyright protection in real-world deployments.

Abstract

Large Language Models (LLMs) are increasingly fine-tuned on smaller, domain-specific datasets to improve downstream performance. These datasets often contain proprietary or copyrighted material, raising the need for reliable safeguards against unauthorized use. Existing membership inference attacks (MIAs) and dataset-inference methods typically require access to internal signals such as logits, while current black-box approaches often rely on handcrafted prompts or a clean reference dataset for calibration, both of which limit practical applicability. Watermarking is a promising alternative, but prior techniques can degrade text quality or reduce task performance. We propose TRACE, a practical framework for fully black-box detection of copyrighted dataset usage in LLM fine-tuning. \texttt{TRACE} rewrites datasets with distortion-free watermarks guided by a private key, ensuring both text quality and downstream utility. At detection time, we exploit the radioactivity effect of fine-tuning on watermarked data and introduce an entropy-gated procedure that selectively scores high-uncertainty tokens, substantially amplifying detection power. Across diverse datasets and model families, TRACE consistently achieves significant detections (p<0.05), often with extremely strong statistical evidence. Furthermore, it supports multi-dataset attribution and remains robust even after continued pretraining on large non-watermarked corpora. These results establish TRACE as a practical route to reliable black-box verification of copyrighted dataset usage. We will make our code available at: https://github.com/NusIoraPrivacy/TRACE.

Paper Structure

This paper contains 26 sections, 9 equations, 3 figures, 6 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of TRACE. The framework has two stages. (Left) The dataset owner generates a watermarked rewrite $D'=\mathcal{W}(D,k)$ of the original dataset using a watermarked rewrite model with a private key $k$, and releases $D'$ publicly. (Right) To verify dataset usage, the owner queries a suspect model $M$ with prompts and collects outputs. High-entropy tokens are selected against the private key to compute the watermark scores. A statistical test is then conducted to decide whether the model exhibits watermark radioactivity, indicating that it was fine-tuned on $D'$.
  • Figure 2: Entropy ablation experiments across four datasets using 40k scored tokens with the top 70% by entropy on LLaMA-3B.
  • Figure 3: Effect of watermarked-sample proportion and number of scored tokens on detection strength for TRACE. Curves plot $-\log_{10}(p)$ (higher = more significant) as a function of the number of scored tokens. (a) The proportion of watermarked samples: $\rho=50\%$. (b) $\rho=100\%$.