Table of Contents
Fetching ...

FlexiMo: A Flexible Remote Sensing Foundation Model

Xuyang Li, Chenyu Li, Pedram Ghamisi, Danfeng Hong

TL;DR

FlexiMo addresses the fixed-resolution limitation of remote sensing foundation models by introducing a resolution-aware, parameter-free patch embedding alignment (PI-Resize) and a wavelength-guided channel adaptation mechanism. The approach dynamically adapts to arbitrary input resolutions and channel counts without changing the backbone, enabling efficient fine-tuning of ViT-based RSFMs across RGB, multispectral, and SAR data. Comprehensive experiments demonstrate state-of-the-art or competitive performance on image- and pixel-level tasks across diverse datasets, with extensive ablations validating robustness to varying image sizes, patch sizes, and channel configurations. This work significantly enhances the practical deployment of foundation models in real-world, multi-source remote sensing pipelines by improving spatial adaptability and spectral fidelity.

Abstract

The rapid expansion of multi-source satellite imagery drives innovation in Earth observation, opening unprecedented opportunities for Remote Sensing Foundation Models to harness diverse data. However, many existing models remain constrained by fixed spatial resolutions and patch sizes, limiting their ability to fully exploit the heterogeneous spatial characteristics inherent in satellite imagery. To address these challenges, we propose FlexiMo, a flexible remote sensing foundation model that endows the pre-trained model with the flexibility to adapt to arbitrary spatial resolutions. Central to FlexiMo is a spatial resolution-aware module that employs a parameter-free alignment embedding mechanism to dynamically recalibrate patch embeddings based on the input image's resolution and dimensions. This design not only preserves critical token characteristics and ensures multi-scale feature fidelity but also enables efficient feature extraction without requiring modifications to the underlying network architecture. In addition, FlexiMo incorporates a lightweight channel adaptation module that leverages prior spectral information from sensors. This mechanism allows the model to process images with varying numbers of channels while maintaining the data's intrinsic physical properties. Extensive experiments on diverse multimodal, multi-resolution, and multi-scale datasets demonstrate that FlexiMo significantly enhances model generalization and robustness. In particular, our method achieves outstanding performance across a range of downstream tasks, including scene classification, land cover classification, urban building segmentation, and cloud detection. By enabling parameter-efficient and physically consistent adaptation, FlexiMo paves the way for more adaptable and effective foundation models in real-world remote sensing applications.

FlexiMo: A Flexible Remote Sensing Foundation Model

TL;DR

FlexiMo addresses the fixed-resolution limitation of remote sensing foundation models by introducing a resolution-aware, parameter-free patch embedding alignment (PI-Resize) and a wavelength-guided channel adaptation mechanism. The approach dynamically adapts to arbitrary input resolutions and channel counts without changing the backbone, enabling efficient fine-tuning of ViT-based RSFMs across RGB, multispectral, and SAR data. Comprehensive experiments demonstrate state-of-the-art or competitive performance on image- and pixel-level tasks across diverse datasets, with extensive ablations validating robustness to varying image sizes, patch sizes, and channel configurations. This work significantly enhances the practical deployment of foundation models in real-world, multi-source remote sensing pipelines by improving spatial adaptability and spectral fidelity.

Abstract

The rapid expansion of multi-source satellite imagery drives innovation in Earth observation, opening unprecedented opportunities for Remote Sensing Foundation Models to harness diverse data. However, many existing models remain constrained by fixed spatial resolutions and patch sizes, limiting their ability to fully exploit the heterogeneous spatial characteristics inherent in satellite imagery. To address these challenges, we propose FlexiMo, a flexible remote sensing foundation model that endows the pre-trained model with the flexibility to adapt to arbitrary spatial resolutions. Central to FlexiMo is a spatial resolution-aware module that employs a parameter-free alignment embedding mechanism to dynamically recalibrate patch embeddings based on the input image's resolution and dimensions. This design not only preserves critical token characteristics and ensures multi-scale feature fidelity but also enables efficient feature extraction without requiring modifications to the underlying network architecture. In addition, FlexiMo incorporates a lightweight channel adaptation module that leverages prior spectral information from sensors. This mechanism allows the model to process images with varying numbers of channels while maintaining the data's intrinsic physical properties. Extensive experiments on diverse multimodal, multi-resolution, and multi-scale datasets demonstrate that FlexiMo significantly enhances model generalization and robustness. In particular, our method achieves outstanding performance across a range of downstream tasks, including scene classification, land cover classification, urban building segmentation, and cloud detection. By enabling parameter-efficient and physically consistent adaptation, FlexiMo paves the way for more adaptable and effective foundation models in real-world remote sensing applications.

Paper Structure

This paper contains 26 sections, 17 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Image Quality and Detail Across Various Spatial Resolutions: Each column presents an image's spatial layout, its frequency spectrum, and the corresponding local information entropy heatmap.
  • Figure 2: An illustration of the proposed FlexiMo. FlexiMo can adjust the patch embedding weights of a pretrained model based on the resolution and image size of the input, enabling better adaptation to the data distribution of downstream tasks.
  • Figure 3: Illustration of the wavelength-guided dynamic weight generation process.
  • Figure 4: Comparison of segmentation visualizations from models employing different patch sizes on the WHU-Building dataset. As the patch size decreases, the model extracts building edges more effectively.
  • Figure 5: Comparison of segmentation visualizations from models employing different patch sizes on the GF12MS-WHU-GF1 dataset. As the patch size decreases, the model shows reduced misclassification and omission rates, thereby enhancing its ability to capture fine-grained details in the images.
  • ...and 3 more figures