Subnational Geocoding of Global Disasters Using Large Language Models
Michele Ronco, Damien Delforge, Wiebke S. Jäger, Christina Corbane
TL;DR
This work tackles the problem of unstructured subnational location data in EM-DAT by introducing LLM-GeoDis, a fully automated workflow that uses GPT-4o to parse free-text locations into hierarchical Admin1–Admin3 representations and then geocodes them via three independent sources (GADM, OpenStreetMap, Wikidata). It reconciles candidate geometries, assigns a cross-source reliability score, and reprojects results to a unified GADM framework, producing a global, openly available geocoded disaster dataset for 2000–2024. The approach is evaluated against manual EM-DAT geocodes and the GDIS dataset, showing high cross-source concordance and broad regional coverage across nine disaster subgroups and 31 disaster types, with Admin1 dominating spatial detail. By enabling scalable, reproducible subnational analyses and improving interoperability with hazard and population layers, this method supports finer-grained risk modeling and policy monitoring under global frameworks, while acknowledging data-quality limitations and the value of human-in-the-loop refinements for high-stakes cases.
Abstract
Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.
