Table of Contents
Fetching ...

Extracting the U.S. building types from OpenStreetMap data

Henrique F. de Arruda, Sandro M. Reia, Shiyang Ruan, Kuldip S. Atwal, Hamdi Kavak, Taylor Anderson, Dieter Pfoser

TL;DR

The paper tackles the lack of nationwide building-type data by deriving a residential/non-residential classification for U.S. OpenStreetMap footprints using an unsupervised approach that leverages overlapping auxiliary OSM data. It produces a comprehensive dataset of $67{,}705{,}475$ buildings organized by metropolitan, micropolitan, and other counties, with a workflow that propagates contextual tags to otherwise untagged footprints. Validation against official ground-truth regions shows high precision for non-residential classifications and high recall for residential classifications, and analyses misclassifications to identify metadata gaps in OSM. The resulting dataset and accompanying code enable researchers and planners to support population estimation, transportation planning, and urban analytics across the United States, even in regions with sparse metadata.

Abstract

Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications. Although essential, such data is often not readily available. To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information. The classification result is validated using authoritative ground truth data for select counties in the U.S. The validation shows a high precision for non-residential building classification and a high recall for residential buildings. We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset. Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM. A major result of this work is the resulting dataset of classifying 67,705,475 buildings. We hope that this data is of value to the scientific community, including urban and transportation planners.

Extracting the U.S. building types from OpenStreetMap data

TL;DR

The paper tackles the lack of nationwide building-type data by deriving a residential/non-residential classification for U.S. OpenStreetMap footprints using an unsupervised approach that leverages overlapping auxiliary OSM data. It produces a comprehensive dataset of buildings organized by metropolitan, micropolitan, and other counties, with a workflow that propagates contextual tags to otherwise untagged footprints. Validation against official ground-truth regions shows high precision for non-residential classifications and high recall for residential classifications, and analyses misclassifications to identify metadata gaps in OSM. The resulting dataset and accompanying code enable researchers and planners to support population estimation, transportation planning, and urban analytics across the United States, even in regions with sparse metadata.

Abstract

Building type information is crucial for population estimation, traffic planning, urban planning, and emergency response applications. Although essential, such data is often not readily available. To alleviate this problem, this work creates a comprehensive dataset by providing residential/non-residential building classification covering the entire United States. We propose and utilize an unsupervised machine learning method to classify building types based on building footprints and available OpenStreetMap information. The classification result is validated using authoritative ground truth data for select counties in the U.S. The validation shows a high precision for non-residential building classification and a high recall for residential buildings. We identified various approaches to improving the quality of the classification, such as removing sheds and garages from the dataset. Furthermore, analyzing the misclassifications revealed that they are mainly due to missing and scarce metadata in OSM. A major result of this work is the resulting dataset of classifying 67,705,475 buildings. We hope that this data is of value to the scientific community, including urban and transportation planners.
Paper Structure (14 sections, 9 figures, 4 tables, 1 algorithm)

This paper contains 14 sections, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Average proportion of buildings without annotations. We considered both the tags associated with the buildings and the OSM geospatial features that overlap with the building footprints, such as "landuse" and "amenity". The states with the lowest fractions of untagged buildings are Rhode Island (0.04), Florida (0.11), Wisconsin (0.24), Washington D.C. (0.24), and Wyoming (0.34). The states with the highest proportions of untagged buildings are Massachusetts (0.92), Connecticut (0.85), North Dakota (0.76), New York (0.69), and West Virginia (0.64).
  • Figure 2: Scheme of the building classification methodology. The OSM data is obtained in two separate data types: the building footprints with their respective tag values and additional auxiliary data (e.g., land use and amenities). First, the buildings are classified with tags indicating residential and non-residential use. Next, the unknown buildings are classified using the additional auxiliary data that overlaps with the building footprints. Finally, the remaining unknown buildings are classified as residential.
  • Figure 3: Illustrations of the identified buildings. Panel (a) shows Hanover, VA, and panel (b) shows Fairfax, VA. The mixed-use and unknown building footprints are not shown.
  • Figure 4: Stacked bar chart showing the fractions of misclassified buildings. The colors represent buildings misclassified due to the absence of tags (no tags), the presence of a residential tag (wrong res. tag), and incorrect residential auxiliary data (wrong res. auxiliary).
  • Figure 5: Zoom in on a region of Fairfax. This is a zoom in on Figure \ref{['fig:indentified_buildings']}A, where we found buildings misclassified as residential.
  • ...and 4 more figures