Table of Contents
Fetching ...

Quantifying Geospatial in the Common Crawl Corpus

Ilya Ilyankou, Meihui Wang, Stefano Cavazzi, James Haworth

TL;DR

This work quantifies geospatial content in the Common Crawl corpus to understand the exposure of LLM pre-training to coordinates and addresses. It uses Gemini 1.5 in a needle-in-a-haystack setup across three CC releases, with Cochran-based sampling to obtain a precise prevalence estimate of $18.7\% \pm 0.5\%$. The findings show substantial geospatial presence (addresses 16.1%, coordinates 7.0%, both 4.3%), with similar rates across languages and a dominant role for Google Maps links in coordinates. The results imply CC substantially informs LLM geospatial capabilities and biases, and highlight CC as a potential resource for geospatial datasets while underscoring the need for quality and bias assessment across regions and languages.

Abstract

Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.

Quantifying Geospatial in the Common Crawl Corpus

TL;DR

This work quantifies geospatial content in the Common Crawl corpus to understand the exposure of LLM pre-training to coordinates and addresses. It uses Gemini 1.5 in a needle-in-a-haystack setup across three CC releases, with Cochran-based sampling to obtain a precise prevalence estimate of . The findings show substantial geospatial presence (addresses 16.1%, coordinates 7.0%, both 4.3%), with similar rates across languages and a dominant role for Google Maps links in coordinates. The results imply CC substantially informs LLM geospatial capabilities and biases, and highlight CC as a potential resource for geospatial datasets while underscoring the need for quality and bias assessment across regions and languages.

Abstract

Large language models (LLMs) exhibit emerging geospatial capabilities, stemming from their pre-training on vast unlabelled text datasets that are often derived from the Common Crawl (CC) corpus. However, the geospatial content within CC remains largely unexplored, impacting our understanding of LLMs' spatial reasoning. This paper investigates the prevalence of geospatial data in recent Common Crawl releases using Gemini 1.5, a powerful language model. By analyzing a sample of documents and manually revising the results, we estimate that 18.7% of web documents in CC contain geospatial information such as coordinates and addresses. We find little difference in prevalence between Enlgish- and non-English-language documents. Our findings provide quantitative insights into the nature and extent of geospatial data in CC, and lay the groundwork for future studies of geospatial biases of LLMs.
Paper Structure (12 sections, 1 equation, 4 figures, 2 tables)

This paper contains 12 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: The distribution of sampled web documents' length, in tokens. We capped the x-axis at 200,000; some documents reach 1M tokens in length.
  • Figure 2: Language frequency among sampled documents vs full releases for the 20 most common languages excluding English (which, at 45%, dwarfs all other languages). UNK is for unknown languages.
  • Figure 3: The prevalence of geospatial data in select CC releases, estimated within $\pm0.5\%$ at $95\%$ confidence
  • Figure 4: Geospatial data frequency in English and non-English documents. Eng + Oth refers to documents that are both in English and some other language.