Copyright Detection in Large Language Models: An Ethical Approach to Generative AI Development
David Szczecina, Senan Gaffori, Edmond Li
TL;DR
Copyright concerns in LLM training data motivate transparent verification of content usage. The paper introduces an open-source platform that extends DE-COP-like detection with a scalable, user-friendly pipeline including passage extraction, paraphrase generation, QA, and multiple-choice evaluation, augmented by SBERT preprocessing and a Pinecone-backed vector store. It reduces computational costs by 10-30% and improves robustness against paraphrasing while enabling end-to-end verification by content creators. The work advances ethical AI development by providing a practical tool for accountability and paving the way for policy-relevant copyright enforcement.
Abstract
The widespread use of Large Language Models (LLMs) raises critical concerns regarding the unauthorized inclusion of copyrighted content in training data. Existing detection frameworks, such as DE-COP, are computationally intensive, and largely inaccessible to independent creators. As legal scrutiny increases, there is a pressing need for a scalable, transparent, and user-friendly solution. This paper introduce an open-source copyright detection platform that enables content creators to verify whether their work was used in LLM training datasets. Our approach enhances existing methodologies by facilitating ease of use, improving similarity detection, optimizing dataset validation, and reducing computational overhead by 10-30% with efficient API calls. With an intuitive user interface and scalable backend, this framework contributes to increasing transparency in AI development and ethical compliance, facilitating the foundation for further research in responsible AI development and copyright enforcement.
