Uncovering large inconsistencies between machine learning derived gridded settlement datasets

Vedran Sekara; Andrea Martini; Manuel Garcia-Herranz; Do-Hyung Kim

Uncovering large inconsistencies between machine learning derived gridded settlement datasets

Vedran Sekara, Andrea Martini, Manuel Garcia-Herranz, Do-Hyung Kim

TL;DR

A global machine learning model is built to predict where datasets agree, and it is found that geographic and socio-economic factors considerably impact overlap, however, there is great variability across countries, suggesting complex interactions between country morphology and dataset overlap.

Abstract

High-resolution human settlement maps provide detailed delineations of where people live and are vital for scientific and practical purposes, such as rapid disaster response, allocation of humanitarian resources, and international development. The increased availability of high-resolution satellite imagery, combined with powerful techniques from machine learning and artificial intelligence, has spurred the creation of a wealth of settlement datasets. However, the precise agreement and alignment between these datasets is not known. Here we quantify the overlap of high-resolution settlement map for 42 African countries developed by Google (Open Buildings), Meta (High Resolution Population Maps) and GRID3 (Geo-Referenced Infrastructure and Demographic Data for Development). Across all studied countries we find large disagreement between datasets on how much area is considered settled. We demonstrate that there are considerable geographic and socio-economic factors at play and build a machine learning model to predict for which areas datasets disagree. It it vital to understand the shortcomings of AI derived high-resolution settlement layers as international organizations, governments, and NGOs are already experimenting with incorporating these into programmatic work. As such, we anticipate our work to be a starting point for more critical and detailed analyses of AI derived datasets for humanitarian, planning, policy, and scientific purposes.

Uncovering large inconsistencies between machine learning derived gridded settlement datasets

TL;DR

Abstract

Paper Structure (5 sections, 4 figures)

This paper contains 5 sections, 4 figures.

Introduction
Comparing settlement datasets
Understanding which factors contribute to mismatch
Discussion
Methods and Materials

Figures (4)

Figure 1: Illustration of the different settlement datasets. a, Satellite imagery of houses in the settlement Pindegumahun located on the northern outskirts of the city Bo in Sierra Leone. b, The Open Buildings (OB) dataset, developed by Google sirko2021continental. Settlement data is provided in csv files containing outlines of buildings (green outline), which are derived from high-resolution satellite imagery. The figure shows the actual inferred OB settlements. There are buildings which are not detected, and there are slight offsets between settlements and the satellite image; this is due to images stemming from two different periods (see Methods for more details). c, The Geo-Referenced Infrastructure and Demographic Data for Development (GRID3) dataset provides settlement data as vector files (i.e. polygons) GRID3. Polygons represent the extent of settled areas (red shared area). d, The High Resolution Settlement Layer (HRSL) developed by Meta tiecke2017mapping is a raster dataset that provides settlement at a resolution of 1 arc-second, approximately 30$\times$30 m at equator (see Methods), here illustrated by blue cells.
Figure 2: Comparison of the number of settled cells across datasources. a, There are large differences in how many settled cells datasets contain per country, shown here for a small sample of countries. Due to differences in country areas we normalized the number of settled cells by the surface area of a country. Countries are sorted according to cells/km$^2$ for GRID3. b, Comparison between the raw number of cells in GRID3 and HRSL shows that GRID3 consistently contains more settled cells (correlation $R=0.90; p\ll10^{-6}$). Each black dot signifies a country. c, GRID3 also consistently contains more settled cells than OB (correlation $R=0.99; p\ll10^{-6}$). d, The relationship between HRSL and OB is less clear ((correlation $R=0.89; p\ll10^{-6}$). For a large majority of countries OB contains more cells, but there are also countries for which HRSL has identified more settled cells.
Figure 3: Large variations in overlap on sub-national level between settlement datasets. a, Average overlap between datasets for 42 countries. High-resolution settlement dataset agree, on average, on only 42.6% of all cells. b, National level statistics can hide a lot of nuance. Full lines denote overlap on a country level, while circles denote overlap on a regional level (admin 1). c, Spatial distribution of agreement between settlement datasets for administrative regions in 42 countries. Overlap is calculated based on $100\times100$ m settlement rasters, and summarized for each region.
Figure 4: Determining which factors influence overlap between high-resolution settlement datasets. Coefficient estimates of the model, excluding intercept, including 95% confidence intervals estimated from 100 bootstrap samples. Confidence intervals are relatively narrow compared to the overall coefficient weights. (Fig SI S12 shows more clearly the distribution of variance for coefficients.) The grey shaded area indicates variables related to the settlement type.

Uncovering large inconsistencies between machine learning derived gridded settlement datasets

TL;DR

Abstract

Uncovering large inconsistencies between machine learning derived gridded settlement datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (4)