Fus-MAE: A cross-attention-based data fusion approach for Masked Autoencoders in remote sensing
Hugo Chan-To-Hing, Bharadwaj Veeravalli
TL;DR
Fus-MAE addresses the challenge of SAR–optical data fusion under limited labeled data by employing a cross-attention-enabled masked autoencoder. The approach introduces a cross-attention based cross-attached patch projection in the encoder for early fusion and a cross-attention decoder for feature-level fusion, exploring independent and consistent masking strategies. Empirical results on BigEarthNet-MM and SEN12MS show Fus-MAE competing with or surpassing state-of-the-art contrastive and MIM-based methods, with especially strong gains when labels are scarce. This work demonstrates that cross-modal interactions can be effectively captured through cross-attention in MAE pretraining, reducing reliance on carefully engineered data augmentations and enabling robust remote sensing fusion across modalities.
Abstract
Self-supervised frameworks for representation learning have recently stirred up interest among the remote sensing community, given their potential to mitigate the high labeling costs associated with curating large satellite image datasets. In the realm of multimodal data fusion, while the often used contrastive learning methods can help bridging the domain gap between different sensor types, they rely on data augmentations techniques that require expertise and careful design, especially for multispectral remote sensing data. A possible but rather scarcely studied way to circumvent these limitations is to use a masked image modelling based pretraining strategy. In this paper, we introduce Fus-MAE, a self-supervised learning framework based on masked autoencoders that uses cross-attention to perform early and feature-level data fusion between synthetic aperture radar and multispectral optical data - two modalities with a significant domain gap. Our empirical findings demonstrate that Fus-MAE can effectively compete with contrastive learning strategies tailored for SAR-optical data fusion and outperforms other masked-autoencoders frameworks trained on a larger corpus.
