Table of Contents
Fetching ...

Generative AI Training and Copyright Law

Tim W. Dornis, Sebastian Stober

TL;DR

The paper analyzes how training generative AI models with copyrighted data intersects with copyright law, arguing that Text and Data Mining and Fair Use do not neatly cover GenAI training. It distinguishes TDM from GenAI training and examines memorization as a separate legal concern, showing that training data can be memorized and that both sharing models and generating memorized content raise infringement risks. It proposes a pragmatic, MIR-informed approach to data provenance through a tiered documentation framework that leverages audio fingerprinting and metadata to improve attribution and transparency. The work highlights the need for explicit permissions or new harmonized legal frameworks and positions the MIR community as a leader in developing fair attribution and responsible data practices for GenAI development.

Abstract

Training generative AI models requires extensive amounts of data. A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies. In a recent interdisciplinary tandem-study, we have argued in detail that this is actually not the case because generative AI training fundamentally differs from TDM. In this article, we share our main findings and the implications for both public and corporate research on generative models. We further discuss how the phenomenon of training data memorization leads to copyright issues independently from the "fair use" and TDM exceptions. Finally, we outline how the ISMIR could contribute to the ongoing discussion about fair practices with respect to generative AI that satisfy all stakeholders.

Generative AI Training and Copyright Law

TL;DR

The paper analyzes how training generative AI models with copyrighted data intersects with copyright law, arguing that Text and Data Mining and Fair Use do not neatly cover GenAI training. It distinguishes TDM from GenAI training and examines memorization as a separate legal concern, showing that training data can be memorized and that both sharing models and generating memorized content raise infringement risks. It proposes a pragmatic, MIR-informed approach to data provenance through a tiered documentation framework that leverages audio fingerprinting and metadata to improve attribution and transparency. The work highlights the need for explicit permissions or new harmonized legal frameworks and positions the MIR community as a leader in developing fair attribution and responsible data practices for GenAI development.

Abstract

Training generative AI models requires extensive amounts of data. A common practice is to collect such data through web scraping. Yet, much of what has been and is collected is copyright protected. Its use may be copyright infringement. In the USA, AI developers rely on "fair use" and in Europe, the prevailing view is that the exception for "Text and Data Mining" (TDM) applies. In a recent interdisciplinary tandem-study, we have argued in detail that this is actually not the case because generative AI training fundamentally differs from TDM. In this article, we share our main findings and the implications for both public and corporate research on generative models. We further discuss how the phenomenon of training data memorization leads to copyright issues independently from the "fair use" and TDM exceptions. Finally, we outline how the ISMIR could contribute to the ongoing discussion about fair practices with respect to generative AI that satisfy all stakeholders.

Paper Structure

This paper contains 28 sections.