Table of Contents
Fetching ...

Granularity at Scale: Estimating Neighborhood Socioeconomic Indicators from High-Resolution Orthographic Imagery and Hybrid Learning

Ethan Brewer, Giovani Valdrighi, Parikshit Solunke, Joao Rulff, Yurii Piadyk, Zhonghui Lv, Jorge Poco, Claudio Silva

TL;DR

This paper investigates estimating neighborhood-level socioeconomic indicators from high-resolution aerial imagery in 94 US cities using two approaches: a supervised CNN based on ResNet50 and a semi-supervised bag-of-visual-words framework. The supervised resizing model achieves strong density estimation with $R^2=0.81$ and MAE around 461 $\frac{ppl}{km^2}$, while income and education reach roughly $R^2=0.48$–$0.51$. The semi-supervised BoVW approach yields $R^2$ ~0.61 for density but only marginal gains for income/education, highlighting both the feasibility for fine-scale density mapping and limitations for other metrics. The work provides a foundation for fine-grained, remotely sensed socioeconomic monitoring and identifies avenues for generalization, temporal forecasting, and integration with survey data to improve now-casting and policy-relevant insights.

Abstract

Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R$^2$ up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from aerial imagery without the need for label data.

Granularity at Scale: Estimating Neighborhood Socioeconomic Indicators from High-Resolution Orthographic Imagery and Hybrid Learning

TL;DR

This paper investigates estimating neighborhood-level socioeconomic indicators from high-resolution aerial imagery in 94 US cities using two approaches: a supervised CNN based on ResNet50 and a semi-supervised bag-of-visual-words framework. The supervised resizing model achieves strong density estimation with and MAE around 461 , while income and education reach roughly . The semi-supervised BoVW approach yields ~0.61 for density but only marginal gains for income/education, highlighting both the feasibility for fine-scale density mapping and limitations for other metrics. The work provides a foundation for fine-grained, remotely sensed socioeconomic monitoring and identifies avenues for generalization, temporal forecasting, and integration with survey data to improve now-casting and policy-relevant insights.

Abstract

Many areas of the world are without basic information on the socioeconomic well-being of the residing population due to limitations in existing data collection methods. Overhead images obtained remotely, such as from satellite or aircraft, can help serve as windows into the state of life on the ground and help "fill in the gaps" where community information is sparse, with estimates at smaller geographic scales requiring higher resolution sensors. Concurrent with improved sensor resolutions, recent advancements in machine learning and computer vision have made it possible to quickly extract features from and detect patterns in image data, in the process correlating these features with other information. In this work, we explore how well two approaches, a supervised convolutional neural network and semi-supervised clustering based on bag-of-visual-words, estimate population density, median household income, and educational attainment of individual neighborhoods from publicly available high-resolution imagery of cities throughout the United States. Results and analyses indicate that features extracted from the imagery can accurately estimate the density (R up to 0.81) of neighborhoods, with the supervised approach able to explain about half the variation in a population's income and education. In addition to the presented approaches serving as a basis for further geographic generalization, the novel semi-supervised approach provides a foundation for future work seeking to estimate fine-scale information from aerial imagery without the need for label data.
Paper Structure (14 sections, 3 equations, 9 figures, 5 tables)

This paper contains 14 sections, 3 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: (A) Illustration of the neighborhoods in states containing cities analyzed in this study. (B) Blow up of neighborhoods in the state of Florida. (C) Blow up of neighborhoods in the county of Hillsborough, Florida which contains the city of Tampa.
  • Figure 2: (A) Illustration of median household income (in 2021 USD) of counties containing the 94 cities examined. City boundaries are in orange. (B) Expanded view of New York City in which its boroughs are coterminous with counties. (C) Expanded view of Chicago in which its city limits are within Cook and DuPage counties (mostly Cook).
  • Figure 3: Processing of a typical neighborhood (this one is in San Jose, CA) for the two processing methods for supervised learning. (A) Patching: The image is split into six 512x512 patches. (B) Resizing: The image is resized to 1353x1350 pixels (the median width and height of a neighborhood).
  • Figure 4: For the semi-supervised approach, example s of 112x112 patches for neighborhoods in different cities. Patch boundaries are denoted with white borders, and census block groups (i.e., neighborhoods) with black borders. In the New York neighborhood, all patches overlapping with the neighborhood are used. For the larger Houston neighborhood, only 50 samples are selected.
  • Figure 5: Visual representation of the ResNet50-based architecture used in the supervised approach. 30% dropout layers are embedded after the first four fully-connected layers.
  • ...and 4 more figures