Table of Contents
Fetching ...

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

Taein Kim, Karstan Bock, Claire Luo, Amanda Liswood, Chloe Poroslay, Emily Wenger

TL;DR

This study provides the first large-scale, controlled evaluation of robots.txt compliance across diverse bots and sites, revealing that compliance wanes as directives become stricter and that AI-related bots often do not check robots.txt consistently. Using both active (four staged robots.txt deployments) and passive analyses on anonymized logs from $36$ sites over $40$ days, the authors show that crawl-delay directives are most effective while disallow rules are least respected, with SEO crawlers most compliant and AI agents mid-range. The work also uncovers substantial variability across individual bots and evidence of user-agent spoofing, which can obscure true compliance patterns. Collectively, these findings challenge the reliability of robots.txt as a sole deterrent against scraping and motivate the search for more enforceable or robust defense mechanisms.

Abstract

Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robots$.$txt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance with robots$.$txt directives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlled robots$.$txt experiments. We find that bots are less likely to comply with stricter robots$.$txt directives, and that certain categories of bots, including AI search crawlers, rarely check robots$.$txt at all. These findings suggest that relying on robots$.$txt files to prevent unwanted scraping is risky and highlight the need for alternative approaches.

Scrapers selectively respect robots.txt directives: evidence from a large-scale empirical study

TL;DR

This study provides the first large-scale, controlled evaluation of robots.txt compliance across diverse bots and sites, revealing that compliance wanes as directives become stricter and that AI-related bots often do not check robots.txt consistently. Using both active (four staged robots.txt deployments) and passive analyses on anonymized logs from sites over days, the authors show that crawl-delay directives are most effective while disallow rules are least respected, with SEO crawlers most compliant and AI agents mid-range. The work also uncovers substantial variability across individual bots and evidence of user-agent spoofing, which can obscure true compliance patterns. Collectively, these findings challenge the reliability of robots.txt as a sole deterrent against scraping and motivate the search for more enforceable or robust defense mechanisms.

Abstract

Online data scraping has taken on new dimensions in recent years, as traditional scrapers have been joined by new AI-specific bots. To counteract unwanted scraping, many sites use tools like the Robots Exclusion Protocol (REP), which places a robotstxt file at the site root to dictate scraper behavior. Yet, the efficacy of the REP is not well-understood. Anecdotal evidence suggests some bots comply poorly with it, but no rigorous study exists to support (or refute) this claim. To understand the merits and limits of the REP, we conduct the first large-scale study of web scraper compliance with robotstxt directives using anonymized web logs from our institution. We analyze the behavior of 130 self-declared bots (and many anonymous ones) over 40 days, using a series of controlled robotstxt experiments. We find that bots are less likely to comply with stricter robotstxt directives, and that certain categories of bots, including AI search crawlers, rarely check robotstxt at all. These findings suggest that relying on robotstxt files to prevent unwanted scraping is risky and highlight the need for alternative approaches.

Paper Structure

This paper contains 20 sections, 1 equation, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Example robots.txt file. This site allows bots with user-agent Googlebot to access all subdomains with a crawl-delay of 15 seconds. All other bots are given a crawl-delay of 30 seconds and can only access data under the /allowed-data/ subdomain.
  • Figure 2: Traditional and AI search, as well as AI data scrapers, are the most active bot types in our dataset. Headless browsers— a browser running sans GUI, commonly used by scrapers— are fourth.
  • Figure 3: Most bots in the top $5$ categories in terms of data scraped collect data steadily, but search engine crawlers buck the trend, driven by YisouSpider's mid-March activity. AI assistants scrape much data relative to session count (Fig \ref{['fig:hist_bots']}).
  • Figure 4: Traditional and AI search crawlers exhibit the most volatility in scraping patterns during the date range encompassed by our dataset. The volatility in these two categories corresponds directly with high scraping activity of YisouSpider (a search engine crawler) and AppleBot (an AI search crawler). We plot the behaviors of the top $5$ categories of bots by session count for simplicity.
  • Figure 5: Original robots.txt file. Sitename and host fields are in all robots.txt versions but were removed for anonymous submission.
  • ...and 6 more figures