Table of Contents
Fetching ...

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, Huimei He, Jian Wang, Jingdong Chen, Ming Yang, Yongjun Zhang, Yansheng Li

TL;DR

SkySense addresses the need for a universal RSFM by jointly modeling multi-modal temporal RSI and geo-context. It introduces a Factorized Multi-Modal Spatiotemporal Encoder, Multi-Granularity Contrastive Learning, and Geo-Context Prototype Learning, pre-trained on $21.5$M sequences to yield a $2.06$B-parameter model. Across $16$ datasets and $7$ tasks, SkySense achieves state-of-the-art results and outperforms $18$ RSFMs, with notable gains in both single- and multi-modal scenarios. The model is designed for flexible downstream use, and pre-trained weights are intended for public release to accelerate Earth observation research.

Abstract

Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.

SkySense: A Multi-Modal Remote Sensing Foundation Model Towards Universal Interpretation for Earth Observation Imagery

TL;DR

SkySense addresses the need for a universal RSFM by jointly modeling multi-modal temporal RSI and geo-context. It introduces a Factorized Multi-Modal Spatiotemporal Encoder, Multi-Granularity Contrastive Learning, and Geo-Context Prototype Learning, pre-trained on M sequences to yield a B-parameter model. Across datasets and tasks, SkySense achieves state-of-the-art results and outperforms RSFMs, with notable gains in both single- and multi-modal scenarios. The model is designed for flexible downstream use, and pre-trained weights are intended for public release to accelerate Earth observation research.

Abstract

Prior studies on Remote Sensing Foundation Model (RSFM) reveal immense potential towards a generic model for Earth Observation. Nevertheless, these works primarily focus on a single modality without temporal and geo-context modeling, hampering their capabilities for diverse tasks. In this study, we present SkySense, a generic billion-scale model, pre-trained on a curated multi-modal Remote Sensing Imagery (RSI) dataset with 21.5 million temporal sequences. SkySense incorporates a factorized multi-modal spatiotemporal encoder taking temporal sequences of optical and Synthetic Aperture Radar (SAR) data as input. This encoder is pre-trained by our proposed Multi-Granularity Contrastive Learning to learn representations across different modal and spatial granularities. To further enhance the RSI representations by the geo-context clue, we introduce Geo-Context Prototype Learning to learn region-aware prototypes upon RSI's multi-modal spatiotemporal features. To our best knowledge, SkySense is the largest Multi-Modal RSFM to date, whose modules can be flexibly combined or used individually to accommodate various tasks. It demonstrates remarkable generalization capabilities on a thorough evaluation encompassing 16 datasets over 7 tasks, from single- to multi-modal, static to temporal, and classification to localization. SkySense surpasses 18 recent RSFMs in all test scenarios. Specifically, it outperforms the latest models such as GFM, SatLas and Scale-MAE by a large margin, i.e., 2.76%, 3.67% and 3.61% on average respectively. We will release the pre-trained weights to facilitate future research and Earth Observation applications.
Paper Structure (26 sections, 14 equations, 16 figures, 14 tables)

This paper contains 26 sections, 14 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: SkySense has achieved superior performance on 16 datasets over 7 distinct tasks compared with 18 state-of-the-art RSFMs and supports a board range of EO imagery interpretations.
  • Figure 2: The overview of our SkySense model architecture.
  • Figure 3: Overview of SkySense pre-training and downstream usage. SkySense employs data augmentations on the input and then feeds the augmented data into the student and teacher networks respectively. Multi-Granularity Contrastive Learning and Cross-Modal Alignment are proposed to pre-train the overall network. The region-specific prototype set $\mathcal{P}$ is learned on the student branch and it is frozen for downstream usage. Enhancing feature with $\mathcal{P}$ is optional. After pre-training, we adopt the parameters of the teacher branch for downstream tasks. Each pre-trained module can be used alone or combined with the others, with the chosen ones either frozen or fine-tuned.
  • Figure 4: (a) Experiment on fine-tuning using different percentages of training data on the AID dataset. (b) The impact of S1-Ts data under varying cloud coverage conditions.
  • Figure 5: Comparison between (a) ESRI LandCover Map and (b) Geo-Context Prototype. The visualization process of Geo-Context Prototype is illustrated in the upper part of this figure.
  • ...and 11 more figures