Table of Contents
Fetching ...

In-Domain Self-Supervised Learning Improves Remote Sensing Image Scene Classification

Ivica Dimitrovski, Ivan Kitanovski, Nikola Simidjievski, Dragi Kocev

TL;DR

This work addresses the challenge of limited labeled data for remote sensing by evaluating in-domain self-supervised pre-training using the iBOT framework with a Vision Transformer backbone trained on the Million-AID dataset. The study benchmarks against ImageNet-based pre-training (both supervised and SSL) and training from scratch across 14 downstream image scene-classification datasets, covering 9 MCC and 5 MLC tasks. Key findings show that in-domain SSL pre-training yields consistent downstream gains, with roughly $1\%$ improvements over ImageNet supervised and up to $2\%$ over ImageNet SSL, and substantially larger gains when fine-tuning (approximately $16.15\%$ MCC and $9.54\%$ MLC on average). The results underscore the importance of aligning pre-training data domain with target tasks and offer reproducible experimental workflows, suggesting practical benefits for remote sensing analysis and prompting further ablations and framework explorations.

Abstract

We investigate the utility of in-domain self-supervised pre-training of vision models in the analysis of remote sensing imagery. Self-supervised learning (SSL) has emerged as a promising approach for remote sensing image classification due to its ability to exploit large amounts of unlabeled data. Unlike traditional supervised learning, SSL aims to learn representations of data without the need for explicit labels. This is achieved by formulating auxiliary tasks that can be used for pre-training models before fine-tuning them on a given downstream task. A common approach in practice to SSL pre-training is utilizing standard pre-training datasets, such as ImageNet. While relevant, such a general approach can have a sub-optimal influence on the downstream performance of models, especially on tasks from challenging domains such as remote sensing. In this paper, we analyze the effectiveness of SSL pre-training by employing the iBOT framework coupled with Vision transformers trained on Million-AID, a large and unlabeled remote sensing dataset. We present a comprehensive study of different self-supervised pre-training strategies and evaluate their effect across 14 downstream datasets with diverse properties. Our results demonstrate that leveraging large in-domain datasets for self-supervised pre-training consistently leads to improved predictive downstream performance, compared to the standard approaches found in practice.

In-Domain Self-Supervised Learning Improves Remote Sensing Image Scene Classification

TL;DR

This work addresses the challenge of limited labeled data for remote sensing by evaluating in-domain self-supervised pre-training using the iBOT framework with a Vision Transformer backbone trained on the Million-AID dataset. The study benchmarks against ImageNet-based pre-training (both supervised and SSL) and training from scratch across 14 downstream image scene-classification datasets, covering 9 MCC and 5 MLC tasks. Key findings show that in-domain SSL pre-training yields consistent downstream gains, with roughly improvements over ImageNet supervised and up to over ImageNet SSL, and substantially larger gains when fine-tuning (approximately MCC and MLC on average). The results underscore the importance of aligning pre-training data domain with target tasks and offer reproducible experimental workflows, suggesting practical benefits for remote sensing analysis and prompting further ablations and framework explorations.

Abstract

We investigate the utility of in-domain self-supervised pre-training of vision models in the analysis of remote sensing imagery. Self-supervised learning (SSL) has emerged as a promising approach for remote sensing image classification due to its ability to exploit large amounts of unlabeled data. Unlike traditional supervised learning, SSL aims to learn representations of data without the need for explicit labels. This is achieved by formulating auxiliary tasks that can be used for pre-training models before fine-tuning them on a given downstream task. A common approach in practice to SSL pre-training is utilizing standard pre-training datasets, such as ImageNet. While relevant, such a general approach can have a sub-optimal influence on the downstream performance of models, especially on tasks from challenging domains such as remote sensing. In this paper, we analyze the effectiveness of SSL pre-training by employing the iBOT framework coupled with Vision transformers trained on Million-AID, a large and unlabeled remote sensing dataset. We present a comprehensive study of different self-supervised pre-training strategies and evaluate their effect across 14 downstream datasets with diverse properties. Our results demonstrate that leveraging large in-domain datasets for self-supervised pre-training consistently leads to improved predictive downstream performance, compared to the standard approaches found in practice.
Paper Structure (10 sections, 2 figures, 1 table)

This paper contains 10 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of the experimental pipeline. (Top) An An in-domain self-supervised pre-training strategy: Using the self-supervised iBOT framework (with ViT-B/16 backbone) with unlabeled Million-AID dataset. (Bottom) A standard pre-training strategy with ImageNet 1K: Models (ViT-B/16) are either pre-trained in a fully supervised manner or using the iBOT framework (with ViT-B/16 backbone) for self-supervised pre-training on ImageNet. All models are then fine-tuned on each of the downstream datasets, separately. We evaluate and report the predictive performance of the resulting fine-tuned models.
  • Figure 2: GradCAM visualizations for the predictions made with fine-tuned models on images sampled from the MLC datasets ('football field' from MLRSNet, 'ship' from AID mlc, and 'tanks' from UC Merced mlc): (first row) Input images with their ground-truth label, (second row) corresponding activation maps from the SSL pre-trained model, (third row) corresponding activation maps from the supervised pre-trained model.