Extracting the U.S. building types from OpenStreetMap data
Henrique F. de Arruda, Sandro M. Reia, Shiyang Ruan, Kuldip S. Atwal, Hamdi Kavak, Taylor Anderson, Dieter Pfoser
TL;DR
The paper tackles the lack of nationwide building-type data by deriving a residential/non-residential classification for U.S. OpenStreetMap footprints using an unsupervised approach that leverages overlapping auxiliary OSM data. It produces a comprehensive dataset of $67{,}705{,}475$ buildings organized by metropolitan, micropolitan, and other counties, with a workflow that propagates contextual tags to otherwise untagged footprints. Validation against official ground-truth regions shows high precision for non-residential classifications and high recall for residential classifications, and analyses misclassifications to identify metadata gaps in OSM. The resulting dataset and accompanying code enable researchers and planners to support population estimation, transportation planning, and urban analytics across the United States, even in regions with sparse metadata.
Abstract
Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications. Although essential, such data is often not readily available. To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information. The classification result is validated using authoritative ground truth data for select counties in the U.S. The validation shows a high precision for non-residential building classification and a high recall for residential buildings. We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset. Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM. A major result of this work is the resulting dataset of classifying 67,705,475 buildings. We hope that this data is of value to the scientific community, including urban and transportation planners.
