Table of Contents
Fetching ...

Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

Haodong Li, Gelei Deng, Yi Liu, Kailong Wang, Yuekang Li, Tianwei Zhang, Yang Liu, Guoai Xu, Guosheng Xu, Haoyu Wang

TL;DR

Digger addresses the challenge of detecting copyright content usage in LLM training by leveraging a loss-gap framework that builds a Baseline LLM, a Reference LLM, and a Vanilla/Target setup to quantify how target content influences model behavior. It employs distributional calibration via Wasserstein distance and ROC-AUC to derive a data-driven loss threshold for membership inference, validated on GPT-2-XL, LLaMA-7b, and real-world quote data. The study demonstrates that content exposure leaves measurable imprints in loss dynamics, with larger models and longer test sequences yielding clearer signals, and provides evidence that such detection is feasible even without explicit ground-truth labels. These contributions support more transparent data governance in LLM development and open-source tooling to audit training data for copyright considerations.

Abstract

Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.

Digger: Detecting Copyright Content Mis-usage in Large Language Model Training

TL;DR

Digger addresses the challenge of detecting copyright content usage in LLM training by leveraging a loss-gap framework that builds a Baseline LLM, a Reference LLM, and a Vanilla/Target setup to quantify how target content influences model behavior. It employs distributional calibration via Wasserstein distance and ROC-AUC to derive a data-driven loss threshold for membership inference, validated on GPT-2-XL, LLaMA-7b, and real-world quote data. The study demonstrates that content exposure leaves measurable imprints in loss dynamics, with larger models and longer test sequences yielding clearer signals, and provides evidence that such detection is feasible even without explicit ground-truth labels. These contributions support more transparent data governance in LLM development and open-source tooling to audit training data for copyright considerations.

Abstract

Pre-training, which utilizes extensive and varied datasets, is a critical factor in the success of Large Language Models (LLMs) across numerous applications. However, the detailed makeup of these datasets is often not disclosed, leading to concerns about data security and potential misuse. This is particularly relevant when copyrighted material, still under legal protection, is used inappropriately, either intentionally or unintentionally, infringing on the rights of the authors. In this paper, we introduce a detailed framework designed to detect and assess the presence of content from potentially copyrighted books within the training datasets of LLMs. This framework also provides a confidence estimation for the likelihood of each content sample's inclusion. To validate our approach, we conduct a series of simulated experiments, the results of which affirm the framework's effectiveness in identifying and addressing instances of content misuse in LLM training processes. Furthermore, we investigate the presence of recognizable quotes from famous literary works within these datasets. The outcomes of our study have significant implications for ensuring the ethical use of copyrighted materials in the development of LLMs, highlighting the need for more transparent and responsible data management practices in this field.
Paper Structure (23 sections, 3 equations, 5 figures, 9 tables)

This paper contains 23 sections, 3 equations, 5 figures, 9 tables.

Figures (5)

  • Figure 1: Different versions of GPT-2 have different numbers of samples learning iterations. The changes in sample loss on the GPT-2 and the magnitude of the loss variation are as follows.
  • Figure 2: Overview of Our preliminary study consists of three phases, namely, 1)Preparation phase, 2) Training phase and 3) Inference phase. LLMs marked with different colors indicate that it was finetuned with different datasets.
  • Figure 3: Overview of Our Methodology consists of three phases, namely, 1) Preparation phase , 2) Simulation Experiment phase, and 3) Confidence Calculation phase. LLMs marked with different colors indicate that it was finetuned with different datasets.
  • Figure 4: The figure represents distribution curve graphs for various stages. We designed three comparative experiments based on the differences in samples within the target dataset.
  • Figure 5: In a real-world setting, distribution curves depicting the different distributions of GPT-2 XL (upward) and LLaMA 7b (downward).