Table of Contents
Fetching ...

Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data

Ivica Dimitrovski, Vlatko Spasev, Ivan Kitanovski

TL;DR

The paper tackles semantic segmentation for Earth observation by leveraging complementary modalities: Very High Resolution aerial imagery and Satellite Image Time Series. It introduces a Late Fusion Deep Learning Model (LF-DLM) with two branches—UNetFormer with a MaxViT encoder for spatial detail and U-TAE for temporal dynamics—that fuses predictions via weighted geometric mean. On the FLAIR dataset, LF-DLM achieves state-of-the-art performance (mIoU around 63% and a variant reaching 64.52%), outperforming single-modality baselines and prior SOTA while respecting strict inference-time limits. The work demonstrates the value of multi-modality fusion for robust, scalable land-cover segmentation in remote sensing, and points to future improvements in class-balanced fusion strategies.

Abstract

Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation using multi-source optical imagery. The findings highlight the importance of multi-modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.

Deep Multimodal Fusion for Semantic Segmentation of Remote Sensing Earth Observation Data

TL;DR

The paper tackles semantic segmentation for Earth observation by leveraging complementary modalities: Very High Resolution aerial imagery and Satellite Image Time Series. It introduces a Late Fusion Deep Learning Model (LF-DLM) with two branches—UNetFormer with a MaxViT encoder for spatial detail and U-TAE for temporal dynamics—that fuses predictions via weighted geometric mean. On the FLAIR dataset, LF-DLM achieves state-of-the-art performance (mIoU around 63% and a variant reaching 64.52%), outperforming single-modality baselines and prior SOTA while respecting strict inference-time limits. The work demonstrates the value of multi-modality fusion for robust, scalable land-cover segmentation in remote sensing, and points to future improvements in class-balanced fusion strategies.

Abstract

Accurate semantic segmentation of remote sensing imagery is critical for various Earth observation applications, such as land cover mapping, urban planning, and environmental monitoring. However, individual data sources often present limitations for this task. Very High Resolution (VHR) aerial imagery provides rich spatial details but cannot capture temporal information about land cover changes. Conversely, Satellite Image Time Series (SITS) capture temporal dynamics, such as seasonal variations in vegetation, but with limited spatial resolution, making it difficult to distinguish fine-scale objects. This paper proposes a late fusion deep learning model (LF-DLM) for semantic segmentation that leverages the complementary strengths of both VHR aerial imagery and SITS. The proposed model consists of two independent deep learning branches. One branch integrates detailed textures from aerial imagery captured by UNetFormer with a Multi-Axis Vision Transformer (MaxViT) backbone. The other branch captures complex spatio-temporal dynamics from the Sentinel-2 satellite image time series using a U-Net with Temporal Attention Encoder (U-TAE). This approach leads to state-of-the-art results on the FLAIR dataset, a large-scale benchmark for land cover segmentation using multi-source optical imagery. The findings highlight the importance of multi-modality fusion in improving the accuracy and robustness of semantic segmentation in remote sensing applications.
Paper Structure (6 sections, 4 figures, 2 tables)

This paper contains 6 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Example patches from the FLAIR dataset. Each patch contains an aerial image with red, green, blue (RGB), and near-infrared (NIR) values; a pixel-precise digital surface model providing an elevation for each pixel; segmentation map with labels for each pixel; and an optical time series from several months, centered on the aerial image. The red frame marks the area that corresponds to the aerial image.
  • Figure 2: The distribution of pixels within the labels across the train, validation, and test sets of the FLAIR dataset.
  • Figure 3: Confusion matrix for LF-DLM on the FLAIR dataset.
  • Figure 4: Example images, ground-truth masks, and inference masks from the FLAIR dataset. The first row shows example images. The second row shows the corresponding ground-truth masks. The third row shows the prediction results of the LF-DLM.