Inner-Probe: Discovering Copyright-related Data Generation in LLM Architecture
Qichao Ma, Rui-Jie Zhu, Peiye Liu, Renye Yan, Fahong Zhang, Ling Liang, Meng Li, Zhaofei Yu, Zongwei Wang, Yimao Cai, Tiejun Huang
TL;DR
This paper addresses the challenge of identifying how copyrighted data in training sets influence LLM outputs. It introduces Inner-Probe, a lightweight framework that leverages multi-head attention signals and an LSTM-based extractor to attribute sub-dataset contributions and to filter non-copyright content via a contrastive learning module. The method achieves high attribution accuracy (often >95%) and strong non-copyright filtering performance (AUC up to 0.954) across multiple models and datasets, while remaining substantially more efficient than prior text- or prompt-based approaches. The work demonstrates practical applicability through real-world case studies (Books3) and extensive experiments, and outlines extensions to larger, multilingual datasets and multimodal models for broader copyright protection in deployment scenarios.
Abstract
Large Language Models (LLMs) utilize extensive knowledge databases and show powerful text generation ability. However, their reliance on high-quality copyrighted datasets raises concerns about copyright infringements in generated texts. Current research often employs prompt engineering or semantic classifiers to identify copyrighted content, but these approaches have two significant limitations: (1) Challenging to identify which specific subdataset (e.g., works from particular authors) influences an LLM's output. (2) Treating the entire training database as copyrighted, hence overlooking the inclusion of non-copyrighted training data. We propose Inner-Probe, a lightweight framework designed to evaluate the influence of copyrighted sub-datasets on LLM-generated texts. Unlike traditional methods relying solely on text, we discover that the results of multi-head attention (MHA) during LLM output generation provide more effective information. Thus, Inner-Probe performs sub-dataset contribution analysis using a lightweight LSTM based network trained on MHA results in a supervised manner. Harnessing such a prior, Inner-Probe enables non-copyrighted text detection through a concatenated global projector trained with unsupervised contrastive learning. Inner-Probe demonstrates 3x improved efficiency compared to semantic model training in sub-dataset contribution analysis on Books3, achieves 15.04% - 58.7% higher accuracy over baselines on the Pile, and delivers a 0.104 increase in AUC for non-copyrighted data filtering.
