Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

Leo Thomas Ramos; Angel D. Sappa

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

Leo Thomas Ramos, Angel D. Sappa

TL;DR

MeCSAFNet introduces a dual-branch encoder–decoder framework for multispectral semantic segmentation that separately processes visible and non-visible spectral bands with ConvNeXt encoders and uses a pyramid fusion pathway augmented by CBAM and the ASAU activation to integrate spatial and spectral information. The architecture supports 4-channel RGB-NIR and 6-channel RGB-NIR+NDVI+NDWI inputs, enabling robust land-cover classification while maintaining efficiency through lightweight variants. Empirical results on Five-Billion-Pixels and ISPRS Potsdam show substantial gains over U-Net, DeepLabV3+, and SegFormer, with up to +19.62% in $mIoU$ on FPB and strong improvements on Potsdam, and with compact variants offering favorable training/inference costs. The work demonstrates the value of explicit spectral modality separation and attention-guided fusion for accurate and scalable multispectral segmentation in remote sensing, with future potential for broader spectral domains and backbone alternatives.

Abstract

This work proposes MeCSAFNet, a multi-branch encoder-decoder architecture for land cover segmentation in multispectral imagery. The model separately processes visible and non-visible channels through dual ConvNeXt encoders, followed by individual decoders that reconstruct spatial information. A dedicated fusion decoder integrates intermediate features at multiple scales, combining fine spatial cues with high-level spectral representations. The feature fusion is further enhanced with CBAM attention, and the ASAU activation function contributes to stable and efficient optimization. The model is designed to process different spectral configurations, including a 4-channel (4c) input combining RGB and NIR bands, as well as a 6-channel (6c) input incorporating NDVI and NDWI indices. Experiments on the Five-Billion-Pixels (FBP) and Potsdam datasets demonstrate significant performance gains. On FBP, MeCSAFNet-base (6c) surpasses U-Net (4c) by +19.21%, U-Net (6c) by +14.72%, SegFormer (4c) by +19.62%, and SegFormer (6c) by +14.74% in mIoU. On Potsdam, MeCSAFNet-large (4c) improves over DeepLabV3+ (4c) by +6.48%, DeepLabV3+ (6c) by +5.85%, SegFormer (4c) by +9.11%, and SegFormer (6c) by +4.80% in mIoU. The model also achieves consistent gains over several recent state-of-the-art approaches. Moreover, compact variants of MeCSAFNet deliver notable performance with lower training time and reduced inference cost, supporting their deployment in resource-constrained environments.

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

TL;DR

on FPB and strong improvements on Potsdam, and with compact variants offering favorable training/inference costs. The work demonstrates the value of explicit spectral modality separation and attention-guided fusion for accurate and scalable multispectral segmentation in remote sensing, with future potential for broader spectral domains and backbone alternatives.

Abstract

Paper Structure (21 sections, 6 equations, 11 figures, 11 tables, 2 algorithms)

This paper contains 21 sections, 6 equations, 11 figures, 11 tables, 2 algorithms.

Introduction
Related Work
Methods
Dataset description
Five-Billion-Pixels
ISPRS Potsdam
Model design
ConvNeXt encoder
Decoding and feature fusion
Performance measurement
Overall accuracy
Intersection over union
Mean intersection over union
Mean F1-score
Training and implementation details
...and 6 more sections

Figures (11)

Figure 1: Example images of the Five-Billion-Pixels dataset.
Figure 2: Example images of the Potsdam dataset.
Figure 3: Overview of the utilized architecture in this work.
Figure 4: Comparison between ResNet, Swin Transformer, and ConvNeXt blocks.
Figure 5: ConvNeXt architecture structure (base version). Stages are connected sequentially, where the output of each stage serves as the input to the subsequent stage through downsampling operations.
...and 6 more figures

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

TL;DR

Abstract

Multi-encoder ConvNeXt Network with Smooth Attentional Feature Fusion for Multispectral Semantic Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)