A Survey of Web Content Control for Generative AI
Michael Dinzinger, Florian Heß, Michael Granitzer
TL;DR
The paper addresses the challenge that generative AI training increasingly relies on web data, raising copyright and data-protection concerns for publishers. It combines a legal analysis of IP and data protection regimes (EU/US) with a technical survey of web standards and ad hoc opt-out mechanisms (REP, metadata, and new protocols like TDM Rep). An empirical study using Common Crawl data evaluates adoption of these approaches, highlighting gaps between policy and practice and the uneven uptake of most ad hoc solutions. The work argues for a pragmatic, interoperable path forward that aligns formal standards with publisher needs and AI developer practices, to enhance data sovereignty without stifling web indexing and access.
Abstract
The groundbreaking advancements around generative AI have recently caused a wave of concern culminating in a row of lawsuits, including high-profile actions against Stability AI and OpenAI. This situation of legal uncertainty has sparked a broad discussion on the rights of content creators and publishers to protect their intellectual property on the web. European as well as US law already provides rough guidelines, setting a direction for technical solutions to regulate web data use. In this course, researchers and practitioners have worked on numerous web standards and opt-out formats that empower publishers to keep their data out of the development of generative AI models. The emerging AI/ML opt-out protocols are valuable in regards to data sovereignty, but again, it creates an adverse situation for a site owners who are overwhelmed by the multitude of recent ad hoc standards to consider. In our work, we want to survey the different proposals, ideas and initiatives, and provide a comprehensive legal and technical background in the context of the current discussion on web publishers control.
