Table of Contents
Fetching ...

SpecSAR-Former: A Lightweight Transformer-based Network for Global LULC Mapping Using Integrated Sentinel-1 and Sentinel-2

Hao Yu, Gen Li, Haoyu Liu, Songyan Zhu, Wenquan Dong, Changjian Li

TL;DR

This work extends a global LULC benchmark by introducing Dynamic World+, a synchronized Sentinel-1/Sentinel-2 dataset, and presents SpecSAR-Former, a lightweight two-branch transformer that fuses spectral and SAR information through Dual Modal Enhancement and Mutual Modal Aggregation. The model employs overlapped patch embeddings, a hierarchical modal interaction encoder, and an all-MLP decoder, guided by a differentiated parameter allocation strategy and a Focal Loss objective to handle class imbalance. Empirical results show state-of-the-art mIoU, OA, and F1 scores (e.g., mIoU = 59.58%, OA = 79.48%, F1 = 71.68%) with only 26.70M parameters and 109.59G FLOPs, while maintaining real-time-ish speed (38.23 FPS). The work demonstrates that cross-modal bidirectional attention and efficient fusion enable accurate, scalable global LULC segmentation with a compact model, offering practical implications for geospatial monitoring and policy support.

Abstract

Recent approaches in remote sensing have increasingly focused on multimodal data, driven by the growing availability of diverse earth observation datasets. Integrating complementary information from different modalities has shown substantial potential in enhancing semantic understanding. However, existing global multimodal datasets often lack the inclusion of Synthetic Aperture Radar (SAR) data, which excels at capturing texture and structural details. SAR, as a complementary perspective to other modalities, facilitates the utilization of spatial information for global land use and land cover (LULC). To address this gap, we introduce the Dynamic World+ dataset, expanding the current authoritative multispectral dataset, Dynamic World, with aligned SAR data. Additionally, to facilitate the combination of multispectral and SAR data, we propose a lightweight transformer architecture termed SpecSAR-Former. It incorporates two innovative modules, Dual Modal Enhancement Module (DMEM) and Mutual Modal Aggregation Module (MMAM), designed to exploit cross-information between the two modalities in a split-fusion manner. These modules enhance the model's ability to integrate spectral and spatial information, thereby improving the overall performance of global LULC semantic segmentation. Furthermore, we adopt an imbalanced parameter allocation strategy that assigns parameters to different modalities based on their importance and information density. Extensive experiments demonstrate that our network outperforms existing transformer and CNN-based models, achieving a mean Intersection over Union (mIoU) of 59.58%, an Overall Accuracy (OA) of 79.48%, and an F1 Score of 71.68% with only 26.70M parameters. The code will be available at https://github.com/Reagan1311/LULC_segmentation.

SpecSAR-Former: A Lightweight Transformer-based Network for Global LULC Mapping Using Integrated Sentinel-1 and Sentinel-2

TL;DR

This work extends a global LULC benchmark by introducing Dynamic World+, a synchronized Sentinel-1/Sentinel-2 dataset, and presents SpecSAR-Former, a lightweight two-branch transformer that fuses spectral and SAR information through Dual Modal Enhancement and Mutual Modal Aggregation. The model employs overlapped patch embeddings, a hierarchical modal interaction encoder, and an all-MLP decoder, guided by a differentiated parameter allocation strategy and a Focal Loss objective to handle class imbalance. Empirical results show state-of-the-art mIoU, OA, and F1 scores (e.g., mIoU = 59.58%, OA = 79.48%, F1 = 71.68%) with only 26.70M parameters and 109.59G FLOPs, while maintaining real-time-ish speed (38.23 FPS). The work demonstrates that cross-modal bidirectional attention and efficient fusion enable accurate, scalable global LULC segmentation with a compact model, offering practical implications for geospatial monitoring and policy support.

Abstract

Recent approaches in remote sensing have increasingly focused on multimodal data, driven by the growing availability of diverse earth observation datasets. Integrating complementary information from different modalities has shown substantial potential in enhancing semantic understanding. However, existing global multimodal datasets often lack the inclusion of Synthetic Aperture Radar (SAR) data, which excels at capturing texture and structural details. SAR, as a complementary perspective to other modalities, facilitates the utilization of spatial information for global land use and land cover (LULC). To address this gap, we introduce the Dynamic World+ dataset, expanding the current authoritative multispectral dataset, Dynamic World, with aligned SAR data. Additionally, to facilitate the combination of multispectral and SAR data, we propose a lightweight transformer architecture termed SpecSAR-Former. It incorporates two innovative modules, Dual Modal Enhancement Module (DMEM) and Mutual Modal Aggregation Module (MMAM), designed to exploit cross-information between the two modalities in a split-fusion manner. These modules enhance the model's ability to integrate spectral and spatial information, thereby improving the overall performance of global LULC semantic segmentation. Furthermore, we adopt an imbalanced parameter allocation strategy that assigns parameters to different modalities based on their importance and information density. Extensive experiments demonstrate that our network outperforms existing transformer and CNN-based models, achieving a mean Intersection over Union (mIoU) of 59.58%, an Overall Accuracy (OA) of 79.48%, and an F1 Score of 71.68% with only 26.70M parameters. The code will be available at https://github.com/Reagan1311/LULC_segmentation.
Paper Structure (33 sections, 10 equations, 7 figures, 6 tables)

This paper contains 33 sections, 10 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Comparison of various fusion techniques. (a) Input fusion integrates inputs using modality-specific operations. (b) Feature fusion employs an attention-based module to merge features in a unidirectional manner. (c) Our interactive fusion method introduces bidirectional cross-modal feature rectification.
  • Figure 2: Segmentation results from proposed SpecSAR-Former and other foundational models on the Google Dynamic World+ dataset, illustrated with bubbles where each bubble's size corresponds to the computational complexity (FLOPs) of the baseline models.
  • Figure 3: Class distribution across 14 global biomes in the dataset.
  • Figure 4: Schematic illustration of the proposed SpecSAR-Former.
  • Figure 5: Illustration of the SAR-to-Spectral cross attention module.
  • ...and 2 more figures