Table of Contents
Fetching ...

SeaMo: A Season-Aware Multimodal Foundation Model for Remote Sensing

Xuyang Li, Chenyu Li, Gemine Vivone, Danfeng Hong

TL;DR

SeaMo tackles the challenge of learning robust, season-aware representations from multimodal remote sensing data by integrating unaligned spatial region sampling, a unified multimodal encoder, and a Temporal-Multimodal fusion block within a masked image modeling framework. The model is trained progressively, first on single-time-point multimodal data and then with multi-season flow through TM blocks, using the SSL4EO-S12 dataset. Empirical results across optical and radar downstream tasks show SeaMo achieving state-of-the-art performance under limited fine-tuning, with strong generalization to segmentation and change-detection tasks, validating the effectiveness of explicit seasonality and multimodal fusion in RS foundation models. The work advances RS foundational modeling by demonstrating that temporal dynamics and multi-source data can be cohesively fused to improve Earth observation applications while providing detailed ablations and generalization analyses. As data scale grows, SeaMo’s framework offers a scalable blueprint for season-aware, multimodal RS foundation models with practical implications for geoscientific analysis.

Abstract

Remote Sensing (RS) data encapsulates rich multi-dimensional information essential for Earth observation. Its vast volume, diverse sources, and temporal continuity make it particularly well-suited for developing large Visual Foundation Models (VFMs). These models serve as powerful feature extractors, leveraging extensive RS data for pretraining and subsequent fine-tuning in various geoscientific applications. However, existing VFMs in the RS domain often concentrate on specific image characteristics, neglecting the full season-aware potential of RS data. To bridge this gap, we introduce SeaMo, a novel VFM that effectively integrates multimodal and multi-seasonal RS information. SeaMo leverages a masked image modeling framework to fully exploit the spatial, spectral, and seasonal dimensions of RS data. Specifically, we employ unaligned spatial region selection to capture spatial heterogeneity, incorporate multi-source inputs for enhanced multimodal integration, and introduce temporal-multimodal fusion blocks to assimilate seasonal variations effectively. By explicitly modeling the complex, season-dependent attributes of RS data, SeaMo enhances generalization, robustness, and adaptability across geoscientific tasks. Extensive experiments and ablation studies demonstrate its superior performance, underscoring its potential as a foundational model for Earth observation.

SeaMo: A Season-Aware Multimodal Foundation Model for Remote Sensing

TL;DR

SeaMo tackles the challenge of learning robust, season-aware representations from multimodal remote sensing data by integrating unaligned spatial region sampling, a unified multimodal encoder, and a Temporal-Multimodal fusion block within a masked image modeling framework. The model is trained progressively, first on single-time-point multimodal data and then with multi-season flow through TM blocks, using the SSL4EO-S12 dataset. Empirical results across optical and radar downstream tasks show SeaMo achieving state-of-the-art performance under limited fine-tuning, with strong generalization to segmentation and change-detection tasks, validating the effectiveness of explicit seasonality and multimodal fusion in RS foundation models. The work advances RS foundational modeling by demonstrating that temporal dynamics and multi-source data can be cohesively fused to improve Earth observation applications while providing detailed ablations and generalization analyses. As data scale grows, SeaMo’s framework offers a scalable blueprint for season-aware, multimodal RS foundation models with practical implications for geoscientific analysis.

Abstract

Remote Sensing (RS) data encapsulates rich multi-dimensional information essential for Earth observation. Its vast volume, diverse sources, and temporal continuity make it particularly well-suited for developing large Visual Foundation Models (VFMs). These models serve as powerful feature extractors, leveraging extensive RS data for pretraining and subsequent fine-tuning in various geoscientific applications. However, existing VFMs in the RS domain often concentrate on specific image characteristics, neglecting the full season-aware potential of RS data. To bridge this gap, we introduce SeaMo, a novel VFM that effectively integrates multimodal and multi-seasonal RS information. SeaMo leverages a masked image modeling framework to fully exploit the spatial, spectral, and seasonal dimensions of RS data. Specifically, we employ unaligned spatial region selection to capture spatial heterogeneity, incorporate multi-source inputs for enhanced multimodal integration, and introduce temporal-multimodal fusion blocks to assimilate seasonal variations effectively. By explicitly modeling the complex, season-dependent attributes of RS data, SeaMo enhances generalization, robustness, and adaptability across geoscientific tasks. Extensive experiments and ablation studies demonstrate its superior performance, underscoring its potential as a foundational model for Earth observation.
Paper Structure (38 sections, 1 equation, 15 figures, 16 tables)

This paper contains 38 sections, 1 equation, 15 figures, 16 tables.

Figures (15)

  • Figure 1: Pretraining workflow of the SeaMo foundation model. The SeaMo architecture integrates three primary components: encoders, Temporal-Multimodal fusion blocks (TM blocks), and decoders. Our approach incorporates a partially overlapping spatial selecting strategy, ensuring that images from the same temporal instance are selected identically across various modalities, while images from different instances exhibit partial overlaps. These processed images then serve as inputs to the network. Following the masked autoencoder paradigm, only visible tokens are processed by the encoder. The TM block effectively merges features from multiple seasons and modalities, culminating in modality-specific decoders that reconstruct the initially masked regions of the images.
  • Figure 2: Different region selection strategies for temporal data. The solid boxes indicate the image regions that are selected and fed into the network. (a) Images from different seasons are selected from the same section. (b) Images from different seasons are selected based on a specific proportion of the full image, ensuring partial overlap. (c) Images from different seasons are selected with no overlap.
  • Figure 3: An illustration of the Temporal-Multimodal (TM) block. In this block, data from each modality not only participate in fusion interactions during the current season but also influence the fusion process in subsequent seasons. For clarity, the symbols in the figure are defined as follows: $k$ denotes the key vector, $q$ denotes the query vector, and $v$ denotes the value vector; $CA$ represents cross-attention; and $f$ indicates the fully connected layer.
  • Figure 4: Three distinct multimodal pretraining strategies.Left: The MIM model is weight-sharing, however, lacks interaction across different modalities. Middle: Data from two modalities are concatenated and then fed into the MIM model. Right: A time series of multimodal images are concatenated (modal-concat) and then processed by the MIM model, followed by a TM block to strengthen representation learning.
  • Figure 5: Sample visualization of the SSL4EO-S12 dataset. Odd-numbered rows represent Sentinel-2 (multi-spectral), and even-numbered rows represent Sentinel-1 (SAR). Each location is documented with four seasonal snapshots.
  • ...and 10 more figures