Table of Contents
Fetching ...

Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing

Xuyang Li, Chenyu Li, Danfeng Hong

TL;DR

Remote sensing foundation models struggle when band sets and resolutions vary across sensors. The Any-Optical-Model (AOM) addresses this with a spectrum-independent tokenizer (SiTok), a multi-scale adaptive patch embedding (MAPE), and a dual self-supervised pretraining scheme that masks and reconstructs per-channel features while aligning semantics across scales. Pretrained on 1.56M samples from Sentinel-2, Landsat-8, and high-resolution datasets, AOM achieves state-of-the-art results under band missing, cross-sensor, and cross-resolution conditions on Geo-Bench and other benchmarks. The work demonstrates robust, cross-sensor generalization and resolution-robust feature extraction, moving toward truly universal optical RS foundation models.

Abstract

Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real world scenarios involving missing bands, cross sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment. To address these limitations, we propose Any Optical Model (AOM), a universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, AOM introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, AOM incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships. Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.

Any-Optical-Model: A Universal Foundation Model for Optical Remote Sensing

TL;DR

Remote sensing foundation models struggle when band sets and resolutions vary across sensors. The Any-Optical-Model (AOM) addresses this with a spectrum-independent tokenizer (SiTok), a multi-scale adaptive patch embedding (MAPE), and a dual self-supervised pretraining scheme that masks and reconstructs per-channel features while aligning semantics across scales. Pretrained on 1.56M samples from Sentinel-2, Landsat-8, and high-resolution datasets, AOM achieves state-of-the-art results under band missing, cross-sensor, and cross-resolution conditions on Geo-Bench and other benchmarks. The work demonstrates robust, cross-sensor generalization and resolution-robust feature extraction, moving toward truly universal optical RS foundation models.

Abstract

Optical satellites, with their diverse band layouts and ground sampling distances, supply indispensable evidence for tasks ranging from ecosystem surveillance to emergency response. However, significant discrepancies in band composition and spatial resolution across different optical sensors present major challenges for existing Remote Sensing Foundation Models (RSFMs). These models are typically pretrained on fixed band configurations and resolutions, making them vulnerable to real world scenarios involving missing bands, cross sensor fusion, and unseen spatial scales, thereby limiting their generalization and practical deployment. To address these limitations, we propose Any Optical Model (AOM), a universal RSFM explicitly designed to accommodate arbitrary band compositions, sensor types, and resolution scales. To preserve distinctive spectral characteristics even when bands are missing or newly introduced, AOM introduces a spectrum-independent tokenizer that assigns each channel a dedicated band embedding, enabling explicit encoding of spectral identity. To effectively capture texture and contextual patterns from sub-meter to hundred-meter imagery, we design a multi-scale adaptive patch embedding mechanism that dynamically modulates the receptive field. Furthermore, to maintain global semantic consistency across varying resolutions, AOM incorporates a multi-scale semantic alignment mechanism alongside a channel-wise self-supervised masking and reconstruction pretraining strategy that jointly models spectral-spatial relationships. Extensive experiments on over 10 public datasets, including those from Sentinel-2, Landsat, and HLS, demonstrate that AOM consistently achieves state-of-the-art (SOTA) performance under challenging conditions such as band missing, cross sensor, and cross resolution settings.

Paper Structure

This paper contains 16 sections, 8 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Limitations of current remote sensing foundation models. Mainstream models fail to adapt to varying spectral bands, spatial resolutions, and image sizes across pretraining and downstream tasks.
  • Figure 2: An illustration of the proposed model. AOM unifies spectral and spatial modeling through channel-wise patch embedding, adaptive multi-scale extraction, spectral-wise masking & reconstruction, and multi-scale alignment, enabling flexible band configurations and robust performance across resolutions.
  • Figure 3: Linear probing accuracy (%) on EuroSAT across different band combinations. EuroSAT images are captured by Sentinel-2 and contain 13 spectral bands (indexed 0–12). The y-axis represents classification accuracy, while the x-axis indicates the number of spectral bands used during training.
  • Figure 4: Visualization of feature maps from different Sentinel-2 spectral bands through SiTok. The results demonstrate the model's ability to extract features independently from each band.
  • Figure 5: Results on two datasets across different patch size configurations. AOM maintains stable accuracy and mIoU across patch granularities, demonstrating robust performance in both classification and segmentation.
  • ...and 1 more figures