Table of Contents
Fetching ...

Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

Ryan Mioduski

TL;DR

This work addresses the challenge of geolocating seventeenth- and eighteenth-century Virginia land patents described in narrative metes-and-bounds. It benchmarks six OpenAI models under direct-to-coordinate and tool-augmented prompting against baselines (GIS analyst, Stanford NER, Mordecai-3, and county centroids) using a 5,471-abstract corpus and 43 gold coordinates. Key contributions include a publicly released machine-readable corpus, authoritative ground-truth coordinates, a rigorous evaluation framework, and a comprehensive cost–accuracy analysis that demonstrates LLMs can achieve macro-scale historical georeferencing with substantial time and cost savings. The best single-call model achieves ~23 km mean error, while a five-call ensemble reduces this to ~19.2 km at marginal extra cost, highlighting a practical Pareto frontier between accuracy and efficiency. Overall, the results support a scalable, transparent, and economically viable approach to digitizing historical archives for GIS-enabled inquiry.

Abstract

Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs' potential for scalable, accurate, cost-effective historical georeferencing.

Benchmarking Large Language Models for Geolocating Colonial Virginia Land Grants

TL;DR

This work addresses the challenge of geolocating seventeenth- and eighteenth-century Virginia land patents described in narrative metes-and-bounds. It benchmarks six OpenAI models under direct-to-coordinate and tool-augmented prompting against baselines (GIS analyst, Stanford NER, Mordecai-3, and county centroids) using a 5,471-abstract corpus and 43 gold coordinates. Key contributions include a publicly released machine-readable corpus, authoritative ground-truth coordinates, a rigorous evaluation framework, and a comprehensive cost–accuracy analysis that demonstrates LLMs can achieve macro-scale historical georeferencing with substantial time and cost savings. The best single-call model achieves ~23 km mean error, while a five-call ensemble reduces this to ~19.2 km at marginal extra cost, highlighting a practical Pareto frontier between accuracy and efficiency. Overall, the results support a scalable, transparent, and economically viable approach to digitizing historical archives for GIS-enabled inquiry.

Abstract

Virginia's seventeenth- and eighteenth-century land patents survive primarily as narrative metes-and-bounds descriptions, limiting spatial analysis. This study systematically evaluates current-generation large language models (LLMs) in converting these prose abstracts into geographically accurate latitude/longitude coordinates within a focused evaluation context. A digitized corpus of 5,471 Virginia patent abstracts (1695-1732) is released, with 43 rigorously verified test cases serving as an initial, geographically focused benchmark. Six OpenAI models across three architectures-o-series, GPT-4-class, and GPT-3.5-were tested under two paradigms: direct-to-coordinate and tool-augmented chain-of-thought invoking external geocoding APIs. Results were compared against a GIS analyst baseline, Stanford NER geoparser, Mordecai-3 neural geoparser, and a county-centroid heuristic. The top single-call model, o3-2025-04-16, achieved a mean error of 23 km (median 14 km), outperforming the median LLM (37.4 km) by 37.5%, the weakest LLM (50.3 km) by 53.5%, and external baselines by 67% (GIS analyst) and 70% (Stanford NER). A five-call ensemble further reduced errors to 19.2 km (median 12.2 km) at minimal additional cost (~USD 0.20 per grant), outperforming the median LLM by 48.7%. A patentee-name redaction ablation slightly increased error (~7%), showing reliance on textual landmark and adjacency descriptions rather than memorization. The cost-effective gpt-4o-2024-08-06 model maintained a 28 km mean error at USD 1.09 per 1,000 grants, establishing a strong cost-accuracy benchmark. External geocoding tools offer no measurable benefit in this evaluation. These findings demonstrate LLMs' potential for scalable, accurate, cost-effective historical georeferencing.

Paper Structure

This paper contains 61 sections, 1 equation, 8 figures, 1 table.

Figures (8)

  • Figure 1: Mean geolocation error by method with 95% CIs on 43 grants; enables direct accuracy comparison across LLMs and baselines.
  • Figure 2: Error distributions by method; shows spread, skew, and outliers for LLMs versus baselines.
  • Figure 3: Error distributions for LLMs vs. GIS analyst only, isolating core methods from heuristic baselines.
  • Figure 4: Cumulative accuracy vs. distance threshold (km); higher‑left curves indicate better performance across thresholds.
  • Figure 5: Cost--accuracy Pareto frontier: mean error (km) versus cost per 1,000 located grants (USD). Points nearer the lower-left frontier represent better cost--accuracy trade-offs.
  • ...and 3 more figures