Table of Contents
Fetching ...

Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing

Kaixuan Lu, Ruiqian Zhang, Xiao Huang, Yuxing Xie, Xiaogang Ning, Hanchao Zhang, Mengke Yuan, Pan Zhang, Tao Wang, Tongkui Liao

TL;DR

This work presents the pattern integration and enhancement vision transformer (PIEViT), a novel SSL framework designed specifically for RS imagery that achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for RS image interpretation tasks.

Abstract

Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing images. However, most remote sensing images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic remote sensing images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self-supervised learning framework designed specifically for remote sensing imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs the Geospatial Pattern Cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. The Feature Integration Projection (FIP) module further refines masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.

Pattern Integration and Enhancement Vision Transformer for Self-Supervised Learning in Remote Sensing

TL;DR

This work presents the pattern integration and enhancement vision transformer (PIEViT), a novel SSL framework designed specifically for RS imagery that achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for RS image interpretation tasks.

Abstract

Recent self-supervised learning (SSL) methods have demonstrated impressive results in learning visual representations from unlabeled remote sensing images. However, most remote sensing images predominantly consist of scenographic scenes containing multiple ground objects without explicit foreground targets, which limits the performance of existing SSL methods that focus on foreground targets. This raises the question: Is there a method that can automatically aggregate similar objects within scenographic remote sensing images, thereby enabling models to differentiate knowledge embedded in various geospatial patterns for improved feature representation? In this work, we present the Pattern Integration and Enhancement Vision Transformer (PIEViT), a novel self-supervised learning framework designed specifically for remote sensing imagery. PIEViT utilizes a teacher-student architecture to address both image-level and patch-level tasks. It employs the Geospatial Pattern Cohesion (GPC) module to explore the natural clustering of patches, enhancing the differentiation of individual features. The Feature Integration Projection (FIP) module further refines masked token reconstruction using geospatially clustered patches. We validated PIEViT across multiple downstream tasks, including object detection, semantic segmentation, and change detection. Experiments demonstrated that PIEViT enhances the representation of internal patch features, providing significant improvements over existing self-supervised baselines. It achieves excellent results in object detection, land cover classification, and change detection, underscoring its robustness, generalization, and transferability for remote sensing image interpretation tasks.

Paper Structure

This paper contains 27 sections, 10 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: PIE Network Structure, performing masked image modeling by self-distillation. Given two different views of an image, which are input into the teacher network and the student network respectively, we apply a stop-gradient (SG) operator in the teacher network, propagating gradients only through the student. The teacher parameters are updated using the exponential moving average (EMA) of the student parameters.
  • Figure 2: Geospatial Pattern Cohesion Score Calculation.
  • Figure 3: The Dual-Stream Feature Learning Framework.
  • Figure 4: Visualization of object detection results on the DIOR test dataset.
  • Figure 5: Visualization of semantic segmentation results on the Potsdam validation dataset.
  • ...and 2 more figures