Table of Contents
Fetching ...

Subnational Geocoding of Global Disasters Using Large Language Models

Michele Ronco, Damien Delforge, Wiebke S. Jäger, Christina Corbane

TL;DR

This work tackles the problem of unstructured subnational location data in EM-DAT by introducing LLM-GeoDis, a fully automated workflow that uses GPT-4o to parse free-text locations into hierarchical Admin1–Admin3 representations and then geocodes them via three independent sources (GADM, OpenStreetMap, Wikidata). It reconciles candidate geometries, assigns a cross-source reliability score, and reprojects results to a unified GADM framework, producing a global, openly available geocoded disaster dataset for 2000–2024. The approach is evaluated against manual EM-DAT geocodes and the GDIS dataset, showing high cross-source concordance and broad regional coverage across nine disaster subgroups and 31 disaster types, with Admin1 dominating spatial detail. By enabling scalable, reproducible subnational analyses and improving interoperability with hazard and population layers, this method supports finer-grained risk modeling and policy monitoring under global frameworks, while acknowledging data-quality limitations and the value of human-in-the-loop refinements for high-stakes cases.

Abstract

Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.

Subnational Geocoding of Global Disasters Using Large Language Models

TL;DR

This work tackles the problem of unstructured subnational location data in EM-DAT by introducing LLM-GeoDis, a fully automated workflow that uses GPT-4o to parse free-text locations into hierarchical Admin1–Admin3 representations and then geocodes them via three independent sources (GADM, OpenStreetMap, Wikidata). It reconciles candidate geometries, assigns a cross-source reliability score, and reprojects results to a unified GADM framework, producing a global, openly available geocoded disaster dataset for 2000–2024. The approach is evaluated against manual EM-DAT geocodes and the GDIS dataset, showing high cross-source concordance and broad regional coverage across nine disaster subgroups and 31 disaster types, with Admin1 dominating spatial detail. By enabling scalable, reproducible subnational analyses and improving interoperability with hazard and population layers, this method supports finer-grained risk modeling and policy monitoring under global frameworks, while acknowledging data-quality limitations and the value of human-in-the-loop refinements for high-stakes cases.

Abstract

Subnational location data of disaster events are critical for risk assessment and disaster risk reduction. Disaster databases such as EM-DAT often report locations in unstructured textual form, with inconsistent granularity or spelling, that make it difficult to integrate with spatial datasets. We present a fully automated LLM-assisted workflow that processes and cleans textual location information using GPT-4o, and assigns geometries by cross-checking three independent geoinformation repositories: GADM, OpenStreetMap and Wikidata. Based on the agreement and availability of these sources, we assign a reliability score to each location while generating subnational geometries. Applied to the EM-DAT dataset from 2000 to 2024, the workflow geocodes 14,215 events across 17,948 unique locations. Unlike previous methods, our approach requires no manual intervention, covers all disaster types, enables cross-verification across multiple sources, and allows flexible remapping to preferred frameworks. Beyond the dataset, we demonstrate the potential of LLMs to extract and structure geographic information from unstructured text, offering a scalable and reliable method for related analyses.

Paper Structure

This paper contains 17 sections, 13 figures, 1 table.

Figures (13)

  • Figure 1: Overview of the geocoding pipeline. Free-text location strings from EM-DAT are first parsed by GPT-4o, which outputs a structured JSON representation of the identified hierarchical administrative units. This intermediate representation is then processed independently through three geocoding procedures—GADM, OpenStreetMap, and Wikidata—to obtain candidate geometries. The resulting geometries are cross-checked across sources to establish consistency and confidence in the final spatial representation of each disaster location.
  • Figure 2: GPT-4o prompt used for parsing disaster location strings with ICL. The prompt provides a sample input and the expected JSON output to guide the model in identifying unique locations, resolving granularity, cleaning names, and organizing them hierarchically into Admin1, Admin2, and Admin3 levels. The name of the country is provided to reduce toponymic ambiguities. This prompt ensures that GPT-4o produces a structured canonical representation suitable for subsequent geocoding.
  • Figure 3: Geocoding coverage and resolution of disaster records (2000–2024). (a) Heatmap showing the proportion of events geocoded by disaster subgroup. (b) Stacked bar chart of geocoded records every five years, illustrating temporal trends. (c) Regional distribution of geocoded disasters, showing relative contributions across continents.
  • Figure 4: Distribution of reported Disaster Events (2000–2024). This map displays disaster occurrences aggregated at the GADM Admin1 level, providing a visual overview of event frequency across regions. The original data, available at a finer resolution, has been remapped to Admin1 for consistency and clarity in global analysis.
  • Figure 5: Area overlap of candidate geometries from datasets derived with (semi-)automatic geocoding (GDIS, LLM-GADM, LLM-OSM, and LLM-Wiki) with benchmark geometries (EM-DAT GAUL). The panels show histograms for different overlap metrics: a) shows the percentage of proportions of single-location administrative candidate area included in the benchmark disaster area. The LLM-Wiki data with Point geometries are reported as a binary variable (i.e., 0 for non-inclusion, and 1 for inclusion). b) shows the percentage of proportions of the candidate disaster area (i.e., dissolved single-location geometries sharing the same DisNo.) included in the benchmark disaster area; c) shows the percentage of proportions of the benchmark disaster area included in the candidate disaster area; and d) shows the percentage of the Jaccard index between the candidate disaster areas and the benchmark disaster areas.
  • ...and 8 more figures