Table of Contents
Fetching ...

High-Resolution Building and Road Detection from Sentinel-2

Wojciech Sirko, Emmanuel Asiedu Brempong, Juliana T. C. Marcos, Abigail Annkah, Abel Korme, Mohammed Alewi Hassen, Krishna Sapkota, Tomer Shekel, Abdoulaye Diack, Sella Nevo, Jason Hickey, John Quinn

TL;DR

The paper demonstrates that a teacher–student framework can leverage freely available Sentinel-2 imagery to reconstruct high-resolution building and road presence at 50 cm, achieving a building $mIoU$ of $79.0\%$ versus a high-resolution teacher at $85.5\%$, by training a multi-task end-to-end model on a large-scale, globally distributed dataset. The approach uses a 32-frame Sentinel-2 stack, an HRNet-based encoder with cross-time fusion, and a decoder that upscales to the target resolution, while also enabling building centroid counting and height prediction. Key findings include strong cross-region generalization, the utility of incidence-angle metadata for label alignment, and clear advantages of temporal fusion and higher label/input resolutions; the method broadens access to fine-grained mapping by exploiting openly available data. These results offer practical impact for large-scale urban analytics, disaster response, and policy planning where high-resolution imagery is unavailable or costly.

Abstract

Mapping buildings and roads automatically with remote sensing typically requires high-resolution imagery, which is expensive to obtain and often sparsely available. In this work we demonstrate how multiple 10 m resolution Sentinel-2 images can be used to generate 50 cm resolution building and road segmentation masks. This is done by training a `student' model with access to Sentinel-2 images to reproduce the predictions of a `teacher' model which has access to corresponding high-resolution imagery. While the predictions do not have all the fine detail of the teacher model, we find that we are able to retain much of the performance: for building segmentation we achieve 79.0\% mIoU, compared to the high-resolution teacher model accuracy of 85.5\% mIoU. We also describe two related methods that work on Sentinel-2 imagery: one for counting individual buildings which achieves $R^2 = 0.91$ against true counts and one for predicting building height with 1.5 meter mean absolute error. This work opens up new possibilities for using freely available Sentinel-2 imagery for a range of tasks that previously could only be done with high-resolution satellite imagery.

High-Resolution Building and Road Detection from Sentinel-2

TL;DR

The paper demonstrates that a teacher–student framework can leverage freely available Sentinel-2 imagery to reconstruct high-resolution building and road presence at 50 cm, achieving a building of versus a high-resolution teacher at , by training a multi-task end-to-end model on a large-scale, globally distributed dataset. The approach uses a 32-frame Sentinel-2 stack, an HRNet-based encoder with cross-time fusion, and a decoder that upscales to the target resolution, while also enabling building centroid counting and height prediction. Key findings include strong cross-region generalization, the utility of incidence-angle metadata for label alignment, and clear advantages of temporal fusion and higher label/input resolutions; the method broadens access to fine-grained mapping by exploiting openly available data. These results offer practical impact for large-scale urban analytics, disaster response, and policy planning where high-resolution imagery is unavailable or costly.

Abstract

Mapping buildings and roads automatically with remote sensing typically requires high-resolution imagery, which is expensive to obtain and often sparsely available. In this work we demonstrate how multiple 10 m resolution Sentinel-2 images can be used to generate 50 cm resolution building and road segmentation masks. This is done by training a `student' model with access to Sentinel-2 images to reproduce the predictions of a `teacher' model which has access to corresponding high-resolution imagery. While the predictions do not have all the fine detail of the teacher model, we find that we are able to retain much of the performance: for building segmentation we achieve 79.0\% mIoU, compared to the high-resolution teacher model accuracy of 85.5\% mIoU. We also describe two related methods that work on Sentinel-2 imagery: one for counting individual buildings which achieves against true counts and one for predicting building height with 1.5 meter mean absolute error. This work opens up new possibilities for using freely available Sentinel-2 imagery for a range of tasks that previously could only be done with high-resolution satellite imagery.
Paper Structure (35 sections, 2 equations, 26 figures, 11 tables)

This paper contains 35 sections, 2 equations, 26 figures, 11 tables.

Figures (26)

  • Figure 1: Example operation of our model, where multiple frames of low-resolution Sentinel-2 imagery are used to make a single frame of high-resolution predictions for a variety of output types. A high-resolution image of the same scene is shown for comparison.
  • Figure 2: Examples of building and road detection from Sentinel-2 imagery, each covering an area of $192^2$ m$^2$. The panels on the left show high-resolution satellite imagery of the scene for comparison; although Sentinel-2 imagery has much lower level of detail in each frame, we are able to predict fine-scale features of buildings and roads.
  • Figure 3: Estimation of the number of buildings in a tile, based on predicting building centroids (left: high resolution image for comparison, centre: Sentinel-2 RGB; right: predicted centroid mask). This method can obtain $R^2 = 0.91$ with respect to true counts even though individual buildings cannot be discerned in the source imagery.
  • Figure 4: Above-ground object height prediction (left: high resolution image for comparison, centre: Sentinel-2 RGB; right: predicted height mask). This method can predict building height with 1.5 meter mean absolute error.
  • Figure 5: Teacher-student setup in this work. The student model is trained to reproduce the same outputs as a high-resolution model using 50 cm resolution imagery, but using only a stack of Sentinel-2 images at 10 m resolution.
  • ...and 21 more figures