GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

Han Jinzhen; JinByeong Lee; JiSung Kim; MinKyung Cho; DaHee Kim; HongSik Yun

GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

Han Jinzhen, JinByeong Lee, JiSung Kim, MinKyung Cho, DaHee Kim, HongSik Yun

TL;DR

GeoFormer presents a Swin Transformer–based framework for joint 100 m BH and BF estimation using Sentinel-1/2 and open DEM data, addressing global scalability and cross-city generalisation. It introduces GeoSplit, a geo-blocked data partitioning scheme, and demonstrates strong generalisation across 54 cities and cross-continent transfer, achieving BH RMSE ~3.19 m and BF RMSE ~0.050. Ablation studies show DEM is essential for height, optical data dominates, and multi-source fusion yields the best accuracy, while increasing model capacity can cause overfitting. The work is openly released with code, weights, and global products, enabling scalable, accessible urban 3D mapping and post-disaster assessment at a continental to global scale.

Abstract

Accurate three-dimensional urban data are critical for climate modelling, disaster risk assessment, and urban planning, yet remain scarce due to reliance on proprietary sensors or poor cross-city generalisation. We propose GeoFormer, an open-source Swin Transformer framework that jointly estimates building height (BH) and footprint (BF) on a 100 m grid using only Sentinel-1/2 imagery and open DEM data. A geo-blocked splitting strategy ensures strict spatial independence between training and test sets. Evaluated over 54 diverse cities, GeoFormer achieves a BH RMSE of 3.19 m and a BF RMSE of 0.05, improving 7.5% and 15.3% over the strongest CNN baseline, while maintaining under 3.5 m BH RMSE in cross-continent transfer. Ablation studies confirm that DEM is indispensable for height estimation and that optical reflectance dominates over SAR, though multi-source fusion yields the best overall accuracy. All code, weights, and global products are publicly released.

GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

TL;DR

Abstract

Paper Structure (25 sections, 5 equations, 24 figures, 9 tables)

This paper contains 25 sections, 5 equations, 24 figures, 9 tables.

Introduction
Data Preprocessing
Reference Data
Explanatory Data and Processing
Methodology
Model Architecture
Loss Function
Optimization and Training Settings
Model Evaluation
Evaluation Metrics and Justification
CNN Baseline Comparison
Experimental Results
Error Source Analysis
Ablation Study
Structural Ablation
...and 10 more sections

Figures (24)

Figure 1: Workflow of the proposed GeoFormer framework.
Figure 2: Illustration of Fishnet Analysis: a 100 m grid overlays vector building footprints to compute per-cell height and footprint coverage.
Figure 3: Geographic distribution of SHAFTS (v2022.3) reference cities.
Figure 4: Structure of a single city group in the final HDF5 file.
Figure 5: Data leakage from random sampling under dynamic receptive field concatenation.
...and 19 more figures

GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

TL;DR

Abstract

GeoFormer: A Swin Transformer-Based Framework for Scene-Level Building Height and Footprint Estimation from Sentinel Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (24)