Table of Contents
Fetching ...

Self-supervised Learning in Remote Sensing: A Review

Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Lichao Mou, Xiao Xiang Zhu

TL;DR

This work surveys self-supervised learning for remote sensing, organizing methods into generative, predictive, and contrastive families and mapping CV advances to earth observation contexts. It discusses RS-specific data characteristics, proposes a taxonomy of pretext tasks across spatial, spectral, temporal, and multi-sensor contexts, and catalogs a suite of RS SSL applications. A preliminary benchmark on BigEarthNet, SEN12MS, and So2Sat-LCZ42 evaluates four representative contrastive methods, revealing robust RS representations with MoCo-based approaches and highlighting the importance of data augmentation and label efficiency. The authors also identify challenges (e.g., model collapse, augmentation design, multimodal integration) and outline SSL4EO directions to bridge CV and RS communities for scalable, label-free representation learning in Earth observation.

Abstract

In deep learning research, self-supervised learning (SSL) has received great attention triggering interest within both the computer vision and remote sensing communities. While there has been a big success in computer vision, most of the potential of SSL in the domain of earth observation remains locked. In this paper, we provide an introduction to, and a review of the concepts and latest developments in SSL for computer vision in the context of remote sensing. Further, we provide a preliminary benchmark of modern SSL algorithms on popular remote sensing datasets, verifying the potential of SSL in remote sensing and providing an extended study on data augmentations. Finally, we identify a list of promising directions of future research in SSL for earth observation (SSL4EO) to pave the way for fruitful interaction of both domains.

Self-supervised Learning in Remote Sensing: A Review

TL;DR

This work surveys self-supervised learning for remote sensing, organizing methods into generative, predictive, and contrastive families and mapping CV advances to earth observation contexts. It discusses RS-specific data characteristics, proposes a taxonomy of pretext tasks across spatial, spectral, temporal, and multi-sensor contexts, and catalogs a suite of RS SSL applications. A preliminary benchmark on BigEarthNet, SEN12MS, and So2Sat-LCZ42 evaluates four representative contrastive methods, revealing robust RS representations with MoCo-based approaches and highlighting the importance of data augmentation and label efficiency. The authors also identify challenges (e.g., model collapse, augmentation design, multimodal integration) and outline SSL4EO directions to bridge CV and RS communities for scalable, label-free representation learning in Earth observation.

Abstract

In deep learning research, self-supervised learning (SSL) has received great attention triggering interest within both the computer vision and remote sensing communities. While there has been a big success in computer vision, most of the potential of SSL in the domain of earth observation remains locked. In this paper, we provide an introduction to, and a review of the concepts and latest developments in SSL for computer vision in the context of remote sensing. Further, we provide a preliminary benchmark of modern SSL algorithms on popular remote sensing datasets, verifying the potential of SSL in remote sensing and providing an extended study on data augmentations. Finally, we identify a list of promising directions of future research in SSL for earth observation (SSL4EO) to pave the way for fruitful interaction of both domains.
Paper Structure (39 sections, 9 equations, 27 figures, 2 tables)

This paper contains 39 sections, 9 equations, 27 figures, 2 tables.

Figures (27)

  • Figure 1: The number of recent publications related to self-supervised learning (SSL). While a clear trend of increased efforts to advance SSL is observed, activity in remote sensing lags behind.
  • Figure 2: The general pipeline of self-supervised learning. The visual representation is learned through self-supervision that comes from the unlabeled data. The learned parameters serve as a pre-trained model and are transferred to supervised downstream tasks for fine-tuning.
  • Figure 3: A taxonomy of self-supervised learning.
  • Figure 4: Variational AutoEncoder (VAE) kingma2013auto. Instead of encoding the input $X$ to a fixed latent vector, VAE maps input $(x_1,x_2,\dots)$ to a multi-dimensional Gaussian distribution with non-zero mean $\mu=(\mu_1,\mu_2,\dots)$ and diagonal covariance matrix $\Sigma=\text{diag}(\sigma_1^2,\sigma_2^2,\dots)$. Reconstruction works through decoding sampled latent vectors $(z_1,z_2,\dots)$ from this distribution.
  • Figure 5: Bidirectional Generative Adversarial Networks (BiGAN) donahue2016adversarial. BiGAN includes an encoder $E$ which maps data $x$ to latent representations $z$. The BiGAN discriminator $\mathcal{D}$ jointly acts in data and latent space: $x$ versus $\mathcal{G}(z)$), and $E(x)$ versus $z$, respectively.
  • ...and 22 more figures