Table of Contents
Fetching ...

SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

Jonathan Roberts, Kai Han, Samuel Albanie

TL;DR

SATIN colocates 27 remote-sensing datasets into a six-task metadataset to probe zero-shot vision-language model generalization on diverse satellite imagery. The paper benchmarks a broad spectrum of VL baselines across varying backbones and pretraining data, revealing that even large, natural-image pretraining yields only ~52% accuracy in this domain. It introduces a standardized evaluation protocol with multi-label, hierarchical, and false-colour tasks, plus a living public leaderboard to track progress. The findings highlight the substantial gap between natural-image pretraining and RS understanding, while also showing that targeted in-domain fine-tuning can yield notable gains with limited data. Overall, SATIN provides a scalable, reproducible platform to accelerate progress in RS interpretation via VL methods and to monitor advancement through a dynamic leaderboard.

Abstract

Interpreting remote sensing imagery enables numerous downstream applications ranging from land-use planning to deforestation monitoring. Robustly classifying this data is challenging due to the Earth's geographic diversity. While many distinct satellite and aerial image classification datasets exist, there is yet to be a benchmark curated that suitably covers this diversity. In this work, we introduce SATellite ImageNet (SATIN), a metadataset curated from 27 existing remotely sensed datasets, and comprehensively evaluate the zero-shot transfer classification capabilities of a broad range of vision-language (VL) models on SATIN. We find SATIN to be a challenging benchmark-the strongest method we evaluate achieves a classification accuracy of 52.0%. We provide a $\href{https://satinbenchmark.github.io}{\text{public leaderboard}}$ to guide and track the progress of VL models in this important domain.

SATIN: A Multi-Task Metadataset for Classifying Satellite Imagery using Vision-Language Models

TL;DR

SATIN colocates 27 remote-sensing datasets into a six-task metadataset to probe zero-shot vision-language model generalization on diverse satellite imagery. The paper benchmarks a broad spectrum of VL baselines across varying backbones and pretraining data, revealing that even large, natural-image pretraining yields only ~52% accuracy in this domain. It introduces a standardized evaluation protocol with multi-label, hierarchical, and false-colour tasks, plus a living public leaderboard to track progress. The findings highlight the substantial gap between natural-image pretraining and RS understanding, while also showing that targeted in-domain fine-tuning can yield notable gains with limited data. Overall, SATIN provides a scalable, reproducible platform to accelerate progress in RS interpretation via VL methods and to monitor advancement through a dynamic leaderboard.

Abstract

Interpreting remote sensing imagery enables numerous downstream applications ranging from land-use planning to deforestation monitoring. Robustly classifying this data is challenging due to the Earth's geographic diversity. While many distinct satellite and aerial image classification datasets exist, there is yet to be a benchmark curated that suitably covers this diversity. In this work, we introduce SATellite ImageNet (SATIN), a metadataset curated from 27 existing remotely sensed datasets, and comprehensively evaluate the zero-shot transfer classification capabilities of a broad range of vision-language (VL) models on SATIN. We find SATIN to be a challenging benchmark-the strongest method we evaluate achieves a classification accuracy of 52.0%. We provide a to guide and track the progress of VL models in this important domain.
Paper Structure (37 sections, 10 figures, 11 tables, 1 algorithm)

This paper contains 37 sections, 10 figures, 11 tables, 1 algorithm.

Figures (10)

  • Figure 1: The SATIN taxonomy. We propose the SATIN benchmark containing 27 constituent datasets spanning 6 distinct tasks. The imagery is globally distributed, comprised of resolutions spanning 5 orders of magnitude and over 250 distinct class labels.
  • Figure 2: Example imagery and ground truth labels for each SATIN task. This subset includes imagery from 10 of the 27 constituent datasets and shows the breadth of diversity in SATIN across different imagery types, resolutions, fields of view and geographic areas.
  • Figure 3: SATIN task performance distribution for the models in Table \ref{['Table:best_8_results']}. A small amount of random horizontal jitter has been arbitrarily added to the scatter points to improve readability. The lowest-scoring purple cross represents the challenging Canadian Cropland dataset.
  • Figure 4: SATIN zero-shot performance for a broad range of VL models. We delineate the performance on SATIN of different models by denoting methods with different symbols, backbones with different colours, and number of pretraining images across the x-axis. Models with the same method and backbone are linked with dashed lines and for clarity, insets are added of the crowded regions. In general, we observe increasing SATIN accuracy with the volume of pretraining data. Of the models we benchmark, we find that a ViT-B/32 CLIP model fine-tuned with a small volume of RS data achieves the highest score.
  • Figure 5: Example SATIN imagery and ground truth labels for Task 1: Land Cover (SAT-4 -- NaSC-TG2) and Task 2: Land Use (WHU-RS19 -- EuroSAT).
  • ...and 5 more figures