Table of Contents
Fetching ...

HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications

Guneet Mutreja, Philipp Schuegraf, Ksenia Bittner

TL;DR

HiRes-FusedMIM tackles the problem of underutilized high-resolution DSM information in building-level remote sensing by introducing a dual-encoder SimMIM framework that learns joint RGB-DSM representations. It relies on a large-scale, high-resolution dataset (over 368k RGB-DSM pairs at 0.2–0.5 m) and a multi-objective loss that combines reconstruction ($\mathcal{L}_{\text{MIM}}$) and contrastive ($\mathcal{L}_{\text{InfoNCE}}$) terms, with $\mathcal{L}_{\text{total}} = (1 - \alpha) \mathcal{L}_{\text{MIM}} + \alpha \mathcal{L}_{\text{InfoNCE}}$ and $\alpha = 0.05$. The model uses separate RGB and DSM encoders (both Swin transformers) with feature concatenation and two decoders to reconstruct masked patches, enabling specialized modality representations and effective cross-modal fusion. Empirical results across classification, semantic segmentation, and instance segmentation show that DSM-informed pre-training yields consistent gains over RGB-only baselines and higher-resolution baselines on datasets like WHU Aerial, LoveDA, Vaihingen, GeoNRW, and UBCv2, with notable improvements in Vaihingen (RGB+DSM $\text{mIoU}=74.4\%$) and GeoNRW (RGB+DSM $\text{mIoU}=61.68\%$), and a strong instance segmentation improvement on UBCv2 ($AP=17.7\%$). The work demonstrates practical impact for building-level analysis and digital twins and provides pretrained weights to catalyze further multi-modal remote sensing research.

Abstract

Recent advances in self-supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high-resolution digital surface models (DSMs) in understanding urban environments, particularly for building-level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes-FusedMIM, a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data. HiRes-FusedMIM utilizes a dual-encoder simple masked image modeling (SimMIM) architecture with a multi-objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes-FusedMIM outperforms previous state-of-the-art geospatial methods on several building-related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine-grained building information; 2) Incorporating DSMs during pre-training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building-level analysis; 3) The dual-encoder architecture of HiRes-FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single-encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.

HiRes-FusedMIM: A High-Resolution RGB-DSM Pre-trained Model for Building-Level Remote Sensing Applications

TL;DR

HiRes-FusedMIM tackles the problem of underutilized high-resolution DSM information in building-level remote sensing by introducing a dual-encoder SimMIM framework that learns joint RGB-DSM representations. It relies on a large-scale, high-resolution dataset (over 368k RGB-DSM pairs at 0.2–0.5 m) and a multi-objective loss that combines reconstruction () and contrastive () terms, with and . The model uses separate RGB and DSM encoders (both Swin transformers) with feature concatenation and two decoders to reconstruct masked patches, enabling specialized modality representations and effective cross-modal fusion. Empirical results across classification, semantic segmentation, and instance segmentation show that DSM-informed pre-training yields consistent gains over RGB-only baselines and higher-resolution baselines on datasets like WHU Aerial, LoveDA, Vaihingen, GeoNRW, and UBCv2, with notable improvements in Vaihingen (RGB+DSM ) and GeoNRW (RGB+DSM ), and a strong instance segmentation improvement on UBCv2 (). The work demonstrates practical impact for building-level analysis and digital twins and provides pretrained weights to catalyze further multi-modal remote sensing research.

Abstract

Recent advances in self-supervised learning have led to the development of foundation models that have significantly advanced performance in various computer vision tasks. However, despite their potential, these models often overlook the crucial role of high-resolution digital surface models (DSMs) in understanding urban environments, particularly for building-level analysis, which is essential for applications like digital twins. To address this gap, we introduce HiRes-FusedMIM, a novel pre-trained model specifically designed to leverage the rich information contained within high-resolution RGB and DSM data. HiRes-FusedMIM utilizes a dual-encoder simple masked image modeling (SimMIM) architecture with a multi-objective loss function that combines reconstruction and contrastive objectives, enabling it to learn powerful, joint representations from both modalities. We conducted a comprehensive evaluation of HiRes-FusedMIM on a diverse set of downstream tasks, including classification, semantic segmentation, and instance segmentation. Our results demonstrate that: 1) HiRes-FusedMIM outperforms previous state-of-the-art geospatial methods on several building-related datasets, including WHU Aerial and LoveDA, demonstrating its effectiveness in capturing and leveraging fine-grained building information; 2) Incorporating DSMs during pre-training consistently improves performance compared to using RGB data alone, highlighting the value of elevation information for building-level analysis; 3) The dual-encoder architecture of HiRes-FusedMIM, with separate encoders for RGB and DSM data, significantly outperforms a single-encoder model on the Vaihingen segmentation task, indicating the benefits of learning specialized representations for each modality. To facilitate further research and applications in this direction, we will publicly release the trained model weights.

Paper Structure

This paper contains 25 sections, 4 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: HiRes-Fused pre-training and fine-tuning.
  • Figure 2: Visualization of the reconstruction capabilities of the pretrained model. The left two columns represent ground truth and reconstructions from RGB samples, while the right two columns depict ground truth and reconstructions from samples.
  • Figure 3: HiRes-FusedMIM Demonstrates Accurate Segmentation of Buildings and Other Urban Features: Visualized Results on Whu Aerial, LoveDA, Vaihingen, GeoNRW, and SpaceNetv1 datasets.
  • Figure 4: HiRes-FusedMIM Effectively Segments Individual Buildings in Diverse Urban Environments: Visualized Examples from the UBCv2 Test Set.