Copyright related risks in the creation and use of ML/AI systems
Daniel M. German
TL;DR
This paper analyzes copyright-related risks in the creation and use of ML/AI systems, focusing on training data ownership, model outputs, and user-generated works. It reviews how different jurisdictions treat training data usage, derivative works, and copyrightability of AI outputs, drawing on cases like Anderson v. Stability AI, Getty v. Stability AI, and DOE v. GitHub, as well as US Copyright Office guidelines. It discusses policy developments via WIPO and ongoing litigation that shape risk exposure for data owners, model operators, and users. It then proposes mitigations, including licensing strategies, contract terms, and proactive governance to balance data-providers' rights with the benefits of AI, acknowledging the unsettled legal landscape.
Abstract
This paper summarizes the current copyright related risks that Machine Learning (ML) and Artificial Intelligence (AI) systems (including Large Language Models --LLMs) incur. These risks affect different stakeholders: owners of the copyright of the training data, the users of ML/AI systems, the creators of trained models, and the operators of AI systems. This paper also provides an overview of ongoing legal cases in the United States related to these risks.
