Table of Contents
Fetching ...

Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing

Minh-Duc Vu, Zuheng Ming, Fangchen Feng, Bissmella Bahaduri, Anissa Mokraoui

TL;DR

This work tackles remote-sensing object detection under data scarcity by introducing interactive masked image modeling (MIM) with cross-attention, enabling rich interactions between masked and unmasked tokens during self-supervised pre-training. By combining RGB and IR modalities and pre-training on large unlabeled multimodal RS datasets, the approach yields significant gains in mAP@.5 on VEDAI, with further improvement from data expansion on AVIID. The combination of a Swin Transformer encoder and a cross-attention MIM framework demonstrates strong benefits for detecting small, densely packed objects in diverse terrains, highlighting the value of multimodal SSL for Earth observation tasks. The method shows promise for broader adoption in domains with limited labeled data, offering a scalable path to improved detection performance through self-supervised multimodal pre-training.

Abstract

Object detection in remote sensing imagery plays a vital role in various Earth observation applications. However, unlike object detection in natural scene images, this task is particularly challenging due to the abundance of small, often barely visible objects across diverse terrains. To address these challenges, multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. Nonetheless, the performance of multimodal learning is often constrained by the limited size of labeled datasets. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data to enhance detection performance. However, conventional MIM such as MAE which uses masked tokens without any contextual information, struggles to capture the fine-grained details due to a lack of interactions with other parts of image. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing. The extensive ablation studies and evluation demonstrate the effectiveness of our approach.

Interactive Masked Image Modeling for Multimodal Object Detection in Remote Sensing

TL;DR

This work tackles remote-sensing object detection under data scarcity by introducing interactive masked image modeling (MIM) with cross-attention, enabling rich interactions between masked and unmasked tokens during self-supervised pre-training. By combining RGB and IR modalities and pre-training on large unlabeled multimodal RS datasets, the approach yields significant gains in mAP@.5 on VEDAI, with further improvement from data expansion on AVIID. The combination of a Swin Transformer encoder and a cross-attention MIM framework demonstrates strong benefits for detecting small, densely packed objects in diverse terrains, highlighting the value of multimodal SSL for Earth observation tasks. The method shows promise for broader adoption in domains with limited labeled data, offering a scalable path to improved detection performance through self-supervised multimodal pre-training.

Abstract

Object detection in remote sensing imagery plays a vital role in various Earth observation applications. However, unlike object detection in natural scene images, this task is particularly challenging due to the abundance of small, often barely visible objects across diverse terrains. To address these challenges, multimodal learning can be used to integrate features from different data modalities, thereby improving detection accuracy. Nonetheless, the performance of multimodal learning is often constrained by the limited size of labeled datasets. In this paper, we propose to use Masked Image Modeling (MIM) as a pre-training technique, leveraging self-supervised learning on unlabeled data to enhance detection performance. However, conventional MIM such as MAE which uses masked tokens without any contextual information, struggles to capture the fine-grained details due to a lack of interactions with other parts of image. To address this, we propose a new interactive MIM method that can establish interactions between different tokens, which is particularly beneficial for object detection in remote sensing. The extensive ablation studies and evluation demonstrate the effectiveness of our approach.
Paper Structure (13 sections, 3 equations, 3 figures, 4 tables)

This paper contains 13 sections, 3 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: Interactive masked image modeling for self-supervised pre-training. The top is the conventional masked image modeling such as MAEhe2022masked. The bottom is the interactive masked image modeling, in which a cross-attention is introduced to create the interaction between unmasked tokens and masked tokens. The features of unmasked token from encoder (green squares) are merged with the features of masked token from the cross-attention module (orange squares) to reconstruct the masked images.
  • Figure 2: Overview of our framework. Our proposed framework consists of two stages: pre-training (top) on the AVIID or DIOR datasets, and fine-tuning (bottom) on VEDAI. During pre-training, the output features of unmasked tokens from the encoder, merged with the output features of masked tokens from the cross-attention module, are used to reconstruct the multimodal images. After pre-training, the decoder is discarded, and the pre-trained encoder serves as the image encoder for fine-tuning on VEDAI.
  • Figure 3: Visual illustration of the effectiveness of the proposed interactive MIM for multimodal object detection in remote sensing images.