Table of Contents
Fetching ...

Web Crawler Restrictions, AI Training Datasets \& Political Biases

Paul Bouchaud, Pedro Ramaciotti

TL;DR

The paper addresses how growing AI crawler restrictions on the web affect the composition of training data for large language models. It combines CrUX-based site popularity, Cloudflare content categorization, robots.txt harvesting, and CommonCrawl derivatives to quantify restriction patterns and their temporal evolution, revealing substantial blocking, especially among high-visibility news and neutral outlets. The findings suggest that such heterogeneous restrictions can bias training data toward hyperpartisan or lower-quality content, with potential implications for model fairness, factuality, and political neutrality. The work calls for deliberate data-curation strategies and broader access controls beyond robots.txt to preserve dataset representativeness, while acknowledging limitations such as English-focused analyses and reliance on platform TOS and robots.txt directives.

Abstract

Large language models rely on web-scraped text for training; concurrently, content creators are increasingly blocking AI crawlers to retain control over their data. We analyze crawler restrictions across the top one million most-visited websites since 2023 and examine their potential downstream effects on training data composition. Our analysis reveals growing restrictions, with blocking patterns varying by website popularity and content type. A quarter of the top thousand websites restrict AI crawlers, decreasing to one-tenth across the broader top million. Content type matters significantly: 34.2% of news outlets disallow OpenAI's GPTBot, rising to 55% for outlets with high factual reporting. Additionally, outlets with neutral political positions impose the strongest restrictions (58%), whereas hyperpartisan websites and those with low factual reporting impose fewer restrictions -only 4.1% of right-leaning outlets block access to OpenAI. Our findings suggest that heterogeneous blocking patterns may skew training datasets toward low-quality or polarized content, potentially affecting the capabilities of models served by prominent AI-as-a-Service providers.

Web Crawler Restrictions, AI Training Datasets \& Political Biases

TL;DR

The paper addresses how growing AI crawler restrictions on the web affect the composition of training data for large language models. It combines CrUX-based site popularity, Cloudflare content categorization, robots.txt harvesting, and CommonCrawl derivatives to quantify restriction patterns and their temporal evolution, revealing substantial blocking, especially among high-visibility news and neutral outlets. The findings suggest that such heterogeneous restrictions can bias training data toward hyperpartisan or lower-quality content, with potential implications for model fairness, factuality, and political neutrality. The work calls for deliberate data-curation strategies and broader access controls beyond robots.txt to preserve dataset representativeness, while acknowledging limitations such as English-focused analyses and reliance on platform TOS and robots.txt directives.

Abstract

Large language models rely on web-scraped text for training; concurrently, content creators are increasingly blocking AI crawlers to retain control over their data. We analyze crawler restrictions across the top one million most-visited websites since 2023 and examine their potential downstream effects on training data composition. Our analysis reveals growing restrictions, with blocking patterns varying by website popularity and content type. A quarter of the top thousand websites restrict AI crawlers, decreasing to one-tenth across the broader top million. Content type matters significantly: 34.2% of news outlets disallow OpenAI's GPTBot, rising to 55% for outlets with high factual reporting. Additionally, outlets with neutral political positions impose the strongest restrictions (58%), whereas hyperpartisan websites and those with low factual reporting impose fewer restrictions -only 4.1% of right-leaning outlets block access to OpenAI. Our findings suggest that heterogeneous blocking patterns may skew training datasets toward low-quality or polarized content, potentially affecting the capabilities of models served by prominent AI-as-a-Service providers.

Paper Structure

This paper contains 23 sections, 8 figures, 1 table.

Figures (8)

  • Figure 1: Distribution of content categories within the one million most visited websites worldwide (CrUX).
  • Figure 2: Fraction of webpages disallowing specific web crawlers on their domain over time, underlying data from CommonCrawl releases.
  • Figure 3: Fraction of webpages disallowing specific web crawlers on their domain, segmented by: (A) Content category per Cloudflare's classification, (B) CrUX popularity bucket. Error bars represent standard deviations over 100 bootstrap with replacement over websites.
  • Figure 4: Fraction of websites disallowing OpenAI's GPTBot and ChatGPT-User, as a function of their political skew and factual reporting assessment by MBFC.
  • Figure 5: Fraction of websites disallowing GPTBot as a function of their audience ideological leaning, as characterized by Robertson et al Robertson2018. Error bars represent Clopper-Pearson 95% confidence intervals. The solid curve shows the fitted quadratic logistic regression with 95% confidence band (shaded region). The fraction disallowing GoogleBot is shown as a baseline.
  • ...and 3 more figures