Table of Contents
Fetching ...

CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding

Mingming Zhang, Qingjie Liu, Yunhong Wang

TL;DR

CtxMIM tackles the difficulty of self-supervised remote sensing learning in dense scenes by introducing a context-enhanced masked image modeling framework. It uses a two-branch Siamese design with a shared Swin encoder, where a context-consistency objective L_{Cc} guides a reconstructive branch to infer meaningful context from patches, yielding a total objective \mathcal{L} = \mathcal{L}_{Re} + \mathcal{L}_{Pr} + \mathcal{L}_{Cc}. Pretrained on a large unlabeled RS corpus of over 1.28 million images, CtxMIM demonstrates strong transfer to land cover classification, semantic segmentation, object detection, and instance segmentation, outperforming both supervised and state-of-the-art SSL baselines. This framework highlights the value of context-aware reconstruction for RS data and offers scalable, generalizable representations for diverse RS tasks.

Abstract

Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling method (CtxMIM), a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability. Code and data will be made public available.

CtxMIM: Context-Enhanced Masked Image Modeling for Remote Sensing Image Understanding

TL;DR

CtxMIM tackles the difficulty of self-supervised remote sensing learning in dense scenes by introducing a context-enhanced masked image modeling framework. It uses a two-branch Siamese design with a shared Swin encoder, where a context-consistency objective L_{Cc} guides a reconstructive branch to infer meaningful context from patches, yielding a total objective \mathcal{L} = \mathcal{L}_{Re} + \mathcal{L}_{Pr} + \mathcal{L}_{Cc}. Pretrained on a large unlabeled RS corpus of over 1.28 million images, CtxMIM demonstrates strong transfer to land cover classification, semantic segmentation, object detection, and instance segmentation, outperforming both supervised and state-of-the-art SSL baselines. This framework highlights the value of context-aware reconstruction for RS data and offers scalable, generalizable representations for diverse RS tasks.

Abstract

Learning representations through self-supervision on unlabeled data has proven highly effective for understanding diverse images. However, remote sensing images often have complex and densely populated scenes with multiple land objects and no clear foreground objects. This intrinsic property generates high object density, resulting in false positive pairs or missing contextual information in self-supervised learning. To address these problems, we propose a context-enhanced masked image modeling method (CtxMIM), a simple yet efficient MIM-based self-supervised learning for remote sensing image understanding. CtxMIM formulates original image patches as a reconstructive template and employs a Siamese framework to operate on two sets of image patches. A context-enhanced generative branch is introduced to provide contextual information through context consistency constraints in the reconstruction. With the simple and elegant design, CtxMIM encourages the pre-training model to learn object-level or pixel-level features on a large-scale dataset without specific temporal or geographical constraints. Finally, extensive experiments show that features learned by CtxMIM outperform fully supervised and state-of-the-art self-supervised learning methods on various downstream tasks, including land cover classification, semantic segmentation, object detection, and instance segmentation. These results demonstrate that CtxMIM learns impressive remote sensing representations with high generalization and transferability. Code and data will be made public available.
Paper Structure (19 sections, 5 equations, 6 figures, 4 tables)

This paper contains 19 sections, 5 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Object density gap between natural and remote sensing images. Natural images have apparent foreground objects in a relatively simple background, but remote sensing images contain multiple objects, especially no apparent foreground objects, in a vast and complicated scene. Hence, extending MIM to remote sensing images will miss contextual information during reconstruction due to considerably high object density.
  • Figure 2: Reconstructive results on remote sensing images. The first column is raw images. The other triples show the masked image with different masking ratios (left), SimMIM xie2022simmim reconstruction (middle), and our CtxMIM reconstruction (right). SimMIM xie2022simmim and the proposed CtxMIM are pretrained on our collected dataset. Although more and more land objects are masked as the ratio becomes larger, our CtxMIM can still reconstruct masked images better in terms of texture and content.
  • Figure 3: High object density in remote sensing images, resulting in mismatched positive pairs in contrastive learning or missing contextual information in reconstructive learning.
  • Figure 4: An illustration of CtxMIM, a simple yet efficient pretraining framework for remote sensing tasks. CtxMIM introduces a novel context-enhanced generative branch to provide contextual information by the context consistency constraint (${\mathcal{L}}_{Cc}$) during the reconstruction, which formulates the original image patches as the reconstructive template. CtxMIM learns highly generalizable and transferable representations for various downstream tasks (e.g., image-level, object-level, and pixel-level).
  • Figure 5: Some CtxMIM reconstructive samples on our pretraining dataset. Each row shows the ground truth, four 2-tuples (the masked image and the CtxMIM reconstruction) of different masking ratios 70%, 75%, 80%, and 85% from left to right.
  • ...and 1 more figures