Quantifying Geospatial in the Common Crawl Corpus
Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, James Haworth
TL;DR
This work quantifies geospatial content in the Common Crawl corpus to understand the exposure of LLM pre-training to coordinates and addresses. It uses Gemini 1.5 in a needle-in-a-haystack setup across three CC releases, with Cochran-based sampling to obtain a precise prevalence estimate of $18.7\% \pm 0.5\%$. The findings show substantial geospatial presence (addresses 16.1%, coordinates 7.0%, both 4.3%), with similar rates across languages and a dominant role for Google Maps links in coordinates. The results imply CC substantially informs LLM geospatial capabilities and biases, and highlight CC as a potential resource for geospatial datasets while underscoring the need for quality and bias assessment across regions and languages.
Abstract
Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.
