Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

Jakob Hackstein; Gencer Sumbul; Kai Norman Clasen; Begüm Demir

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

Jakob Hackstein, Gencer Sumbul, Kai Norman Clasen, Begüm Demir

TL;DR

This work addresses cross-sensor CBIR in remote sensing by introducing Cross-Sensor Masked Autoencoders (CSMAEs) that extend masked image modeling to sensor-agnostic settings. It defines four CSMAE variants that jointly learn intra- and inter-modal representations by masking patches across two sensors and reconstructing both uni-modal and cross-modal patches, with optional inter-modal latent similarity losses to align representations. Extensive experiments on BEN-14K and BEN-270K demonstrate CSMAEs' superiority over uni-modal MAEs and several baselines for cross-sensor retrieval, while balancing model capacity and training data needs. The study provides practical guidelines for selecting CSMAE architectures based on data availability and compute, and positions CSMAEs as a step toward scalable, sensor-agnostic RS representation learning with broader applicability.

Abstract

Self-supervised learning through masked autoencoders (MAEs) has recently attracted great attention for remote sensing (RS) image representation learning, and thus embodies a significant potential for content-based image retrieval (CBIR) from ever-growing RS image archives. However, the existing MAE based CBIR studies in RS assume that the considered RS images are acquired by a single image sensor, and thus are only suitable for uni-modal CBIR problems. The effectiveness of MAEs for cross-sensor CBIR, which aims to search semantically similar images across different image modalities, has not been explored yet. In this paper, we take the first step to explore the effectiveness of MAEs for sensor-agnostic CBIR in RS. To this end, we present a systematic overview on the possible adaptations of the vanilla MAE to exploit masked image modeling on multi-sensor RS image archives (denoted as cross-sensor masked autoencoders [CSMAEs]) in the context of CBIR. Based on different adjustments applied to the vanilla MAE, we introduce different CSMAE models. We also provide an extensive experimental analysis of these CSMAE models. We finally derive a guideline to exploit masked image modeling for uni-modal and cross-modal CBIR problems in RS. The code of this work is publicly available at https://github.com/jakhac/CSMAE.

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

TL;DR

Abstract

Paper Structure (26 sections, 4 equations, 4 figures, 11 tables)

This paper contains 26 sections, 4 equations, 4 figures, 11 tables.

Introduction
Related Work
Masked Autoencoders in RS
Cross-Sensor CBIR in RS
Cross-Sensor Masked Autoencoders (CSMAE) for Sensor-Agnostic Remote Sensing Image Retrieval
Basics on Masked Autoencoders
Adaptation of MAEs for Sensor-Agnostic Image Retrieval
Adaptation on Image Masking
Adaptation on ViT Architecture
Adaptation on Masked Image Modeling
Data Set Description and Experimental Setup
Data Set Description
Experimental Setup
Experimental Results
Sensitivity Analysis
...and 11 more sections

Figures (4)

Figure 1: An illustration of CSMAEs. During the forward pass, the required steps associated with each RS image modality are shown by using a different color.
Figure 2: An illustration of three different multi-modal masking correspondences: (a) identical; (b) random; and (c) disjoint. For each one, if the same local areas are masked out on images from different sensors, they are shown in green. Otherwise, they are shown in red.
Figure 3: An illustration of four different CSMAE models. A CSMAE is composed of a multi-sensor encoder, a cross-sensor encoder and a multi-sensor decoder, each of which is based on ViTs. For (a) CSMAE-CECD and (b) CSMAE-CESD, the multi-sensor encoder employs a sensor-common encoder for producing the latent representations of the unmasked patches. For (c) CSMAE-SECD and (d) CSMAE-SESD, the multi-sensor encoder employs sensor-specific encoders, where ViT encoders with different parameters are utilized for different image modalities. For (a) CSMAE-CECD and (c) CSMAE-SECD, the multi-sensor decoder employs a sensor-common decoder for reconstruction. For (b) CSMAE-CESD and (d) CSMAE-SESD, the multi-sensor decoder employs sensor-specific decoders, where ViT decoders with different parameters are utilized for different image modalities.
Figure 4: S1$\rightarrow$S2 retrieval results for (a) S1 query image, (b) S2 image acquired on the same geographical area with the query image and S2 images retrieved by using: (c) MAE; (d) MAE-RVSA; (e) SS-CMIR; (f) MaskVLM; (g) CSMAE-CECD; (h) CSMAE-CESD; (i) CSMAE-SECD; and (j) CSMAE-SESD, which are trained on BEN-270K.

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

TL;DR

Abstract

Exploring Masked Autoencoders for Sensor-Agnostic Image Retrieval in Remote Sensing

Authors

TL;DR

Abstract

Table of Contents

Figures (4)