Digger: Detecting Copyright Content Mis-usage in Large Language Model Training
Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang
TL;DR
Digger addresses the challenge of detecting copyright content usage in LLM training by leveraging a loss-gap framework that builds a Baseline LLM, a Reference LLM, and a Vanilla/Target setup to quantify how target content influences model behavior. It employs distributional calibration via Wasserstein distance and ROC-AUC to derive a data-driven loss threshold for membership inference, validated on GPT-2-XL, LLaMA-7b, and real-world quote data. The study demonstrates that content exposure leaves measurable imprints in loss dynamics, with larger models and longer test sequences yielding clearer signals, and provides evidence that such detection is feasible even without explicit ground-truth labels. These contributions support more transparent data governance in LLM development and open-source tooling to audit training data for copyright considerations.
Abstract
Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.
