Table of Contents
Fetching ...

Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers

Enze Liu, Elisa Luo, Shawn Shan, Geoffrey M. Voelker, Ben Y. Zhao, Stefan Savage

TL;DR

This work analyzes whether content creators can effectively defend against AI crawlers using existing web-era controls. It combines a longitudinal robots.txt study, a targeted artist survey (n=203), and large-scale site measurements to assess awareness, agency, and efficacy of protections, including active blocking and third-party services. Key findings show broad early adoption of robots.txt by well-resourced sites, substantial gaps in creator awareness and capability, and mixed efficacy—especially for AI assistant crawlers—highlighting both the promise and limits of current mechanisms. The study underscores a need for more accessible, verifiable, and comprehensive controls that align with creator needs and evolving legal frameworks, while demonstrating that tools like robots.txt and active blocking are not a panacea in isolation. Overall, the results advocate for improved tooling, clearer signaling from crawlers, and coordinated policy developments to better safeguard creators online.

Abstract

The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as robots.txt, NoAI meta tags, and active crawler blocking by reverse proxies. In this work, we seek to understand the ability and efficacy of today's networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as robots.txt, and can such tools be effective? Using large scale measurements and a targeted user study of 203 professional artists, we find strong demand for tools like robots.txt, but significantly constrained by critical hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network-level crawler blockers provided by reverse proxies. Despite relatively limited deployment today, they offer stronger protections against AI crawlers, but still come with their own set of limitations.

Somesite I Used To Crawl: Awareness, Agency and Efficacy in Protecting Content Creators From AI Crawlers

TL;DR

This work analyzes whether content creators can effectively defend against AI crawlers using existing web-era controls. It combines a longitudinal robots.txt study, a targeted artist survey (n=203), and large-scale site measurements to assess awareness, agency, and efficacy of protections, including active blocking and third-party services. Key findings show broad early adoption of robots.txt by well-resourced sites, substantial gaps in creator awareness and capability, and mixed efficacy—especially for AI assistant crawlers—highlighting both the promise and limits of current mechanisms. The study underscores a need for more accessible, verifiable, and comprehensive controls that align with creator needs and evolving legal frameworks, while demonstrating that tools like robots.txt and active blocking are not a panacea in isolation. Overall, the results advocate for improved tooling, clearer signaling from crawlers, and coordinated policy developments to better safeguard creators online.

Abstract

The success of generative AI relies heavily on training on data scraped through extensive crawling of the Internet, a practice that has raised significant copyright, privacy, and ethical concerns. While few measures are designed to resist a resource-rich adversary determined to scrape a site, crawlers can be impacted by a range of existing tools such as robots.txt, NoAI meta tags, and active crawler blocking by reverse proxies. In this work, we seek to understand the ability and efficacy of today's networking tools to protect content creators against AI-related crawling. For targeted populations like human artists, do they have the technical knowledge and agency to utilize crawler-blocking tools such as robots.txt, and can such tools be effective? Using large scale measurements and a targeted user study of 203 professional artists, we find strong demand for tools like robots.txt, but significantly constrained by critical hurdles in technical awareness, agency in deploying them, and limited efficacy against unresponsive crawlers. We further test and evaluate network-level crawler blockers provided by reverse proxies. Despite relatively limited deployment today, they offer stronger protections against AI crawlers, but still come with their own set of limitations.

Paper Structure

This paper contains 44 sections, 7 figures, 12 tables.

Figures (7)

  • Figure 1: In this example robots.txt file, Googlebot is allowed to crawl all URLs on the website, ChatGPT-User and GPTBot are disallowed from crawling any URLs, and all other crawlers are disallowed from crawling URLs under the /secret/ directory.
  • Figure 2: Percent of sites that fully disallow at least one AI crawler user agent for the Stable Top 5k (2,551 sites) and the remaining sites in the Stable Top 100k (37,904 sites).
  • Figure 3: Percent of Stable Top 100k sites that partially or fully disallow an AI crawler user agent in robots.txt over time. The vertical line indicates the release of the EU AI Act.
  • Figure 4: Number of sites that explicitly allow at least one AI crawler in their robots.txt over time, and number of sites that removed restrictions on AI crawlers in each time period. The vertical lines indicate public data deals between major publishers (who control 40+ domains) and OpenAI.
  • Figure 5: Squarespace provides a user-friendly option for controlling whether AI-related crawlers are disallowed in a site's robots.txt.
  • ...and 2 more figures