Table of Contents
Fetching ...

CMAB: A First National-Scale Multi-Attribute Building Dataset in China Derived from Open Source Data and GeoAI

Yecheng Zhang, Huimin Zhao, Ying Long

TL;DR

A geospatial artificial intelligence (GeoAI) framework for large-scale building modeling is introduced, presenting the first national-scale Multi-Attribute Building dataset (CMAB), covering 3,667 spatial cities, 29 million buildings, and 21.3 billion square meters of rooftops with an F1-Score of 89.93% in OCRNet-based extraction.

Abstract

Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper introduces a geospatial artificial intelligence (GeoAI) framework for large-scale building modeling, presenting the first national-scale Multi-Attribute Building dataset (CMAB), covering 3,667 spatial cities, 29 million buildings, and 21.3 billion square meters of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 337.7 billion cubic meters of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating features such as morphology, location, and function. Using multi-source data, including billions of high-resolution Google Earth images and 60 million street view images (SVIs), we generated rooftop, height, function, age, and quality attributes for each building. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.

CMAB: A First National-Scale Multi-Attribute Building Dataset in China Derived from Open Source Data and GeoAI

TL;DR

A geospatial artificial intelligence (GeoAI) framework for large-scale building modeling is introduced, presenting the first national-scale Multi-Attribute Building dataset (CMAB), covering 3,667 spatial cities, 29 million buildings, and 21.3 billion square meters of rooftops with an F1-Score of 89.93% in OCRNet-based extraction.

Abstract

Rapidly acquiring three-dimensional (3D) building data, including geometric attributes like rooftop, height and orientations, as well as indicative attributes like function, quality, and age, is essential for accurate urban analysis, simulations, and policy updates. Current building datasets suffer from incomplete coverage of building multi-attributes. This paper introduces a geospatial artificial intelligence (GeoAI) framework for large-scale building modeling, presenting the first national-scale Multi-Attribute Building dataset (CMAB), covering 3,667 spatial cities, 29 million buildings, and 21.3 billion square meters of rooftops with an F1-Score of 89.93% in OCRNet-based extraction, totaling 337.7 billion cubic meters of building stock. We trained bootstrap aggregated XGBoost models with city administrative classifications, incorporating features such as morphology, location, and function. Using multi-source data, including billions of high-resolution Google Earth images and 60 million street view images (SVIs), we generated rooftop, height, function, age, and quality attributes for each building. Accuracy was validated through model benchmarks, existing similar products, and manual SVI validation, mostly above 80%. Our dataset and results are crucial for global SDGs and urban planning.
Paper Structure (24 sections, 14 equations, 20 figures, 3 tables)

This paper contains 24 sections, 14 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: The overall workflow of this study.
  • Figure 2: Sampling with spatial cities in China and the completeness of multi-source data.(a) Spatial city and administrative city distribution. (b) The ratio of multi-source data within physical cities to the total multi-source data across China. The diagram shows that most of the data in different datasets are concentrated in spatial cities that only occupy 1% area of China.
  • Figure 3: Building attributes and visualized source. "building attribute" records the attributes calculated, and the "data source" is the type of data source visualized for calculation.
  • Figure 4: Construction of building features for height and function estimation.The features marked in bold are categorical and are encoded using label encoding and the category type to enable XGBoost >1.3 to automatically recognize categorical features. The features marked in red indicate new characteristics in the function index system compared to the height index system. See Supplementary Table 2 for more details.
  • Figure 5: Evaluating the quality of buildings along the street through SVIs.(a) Building disorder types k for building quality. (b) Temporal and spatial distribution of SVI in the building buffer zone.
  • ...and 15 more figures