Table of Contents
Fetching ...

Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

Haozhe Si, Yuxuan Wan, Minh Do, Deepak Vasisht, Han Zhao, Hendrik F. Hamann

TL;DR

This work tackles the challenge of building scalable, multi-modal, hyperspectral geospatial foundation models. It introduces LESS ViT, a low-rank spatial–spectral transformer with Hyperspectral Patch Embedding, LESS Attention, and a Perception Field Mask to efficiently model spatial–spectral correlations across arbitrary channel counts and resolutions. Complementing this, Hyper-MAE decouples spatial and spectral masking in a masked autoencoder pretraining objective, and GFM-Bench standardizes evaluation across diverse geospatial tasks. Empirically, LESS ViT achieves competitive results against state-of-the-art baselines and demonstrates superior cross-satellite generalization and efficiency, validating its potential for broad geospatial analysis. The combination of physics-informed embeddings, efficient attention, and robust benchmarking positions this framework as a practical pathway for scalable, multi-modal Earth observation tasks.

Abstract

Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

Towards Scalable Foundation Model for Multi-modal and Hyperspectral Geospatial Data

TL;DR

This work tackles the challenge of building scalable, multi-modal, hyperspectral geospatial foundation models. It introduces LESS ViT, a low-rank spatial–spectral transformer with Hyperspectral Patch Embedding, LESS Attention, and a Perception Field Mask to efficiently model spatial–spectral correlations across arbitrary channel counts and resolutions. Complementing this, Hyper-MAE decouples spatial and spectral masking in a masked autoencoder pretraining objective, and GFM-Bench standardizes evaluation across diverse geospatial tasks. Empirically, LESS ViT achieves competitive results against state-of-the-art baselines and demonstrates superior cross-satellite generalization and efficiency, validating its potential for broad geospatial analysis. The combination of physics-informed embeddings, efficient attention, and robust benchmarking positions this framework as a practical pathway for scalable, multi-modal Earth observation tasks.

Abstract

Geospatial raster data, such as that collected by satellite-based imaging systems at different times and spectral bands, hold immense potential for enabling a wide range of high-impact applications. This potential stems from the rich information that is spatially and temporally contextualized across multiple channels and sensing modalities. Recent work has adapted existing self-supervised learning approaches for such geospatial data. However, they fall short of scalable model architectures, leading to inflexibility and computational inefficiencies when faced with an increasing number of channels and modalities. To address these limitations, we introduce Low-rank Efficient Spatial-Spectral Vision Transformer with three key innovations: i) the LESS Attention Block that approximates high-dimensional spatial-spectral attention through Kronecker's product of the low-dimensional spatial and spectral attention components; ii) the Continuous Positional-Channel Embedding Layer that preserves both the continuity and physical characteristics of each spatial-spectral patch; and iii) the Perception Field Mask that exploits local spatial dependencies by constraining attention to neighboring patches. To evaluate the proposed innovations, we construct GFM-Bench, which serves as a comprehensive benchmark for such geospatial raster data. We pretrain LESS ViT using a Hyperspectral Masked Autoencoder framework with integrated positional and channel masking strategies. Experimental results demonstrate that our proposed method achieves competitive performance against state-of-the-art multi-modal geospatial foundation models while outperforming them on cross-satellite generalization tasks with higher computational efficiency. The flexibility and extensibility of our framework make it a promising direction for future geospatial data analysis tasks that involve a wide range of modalities and channels.

Paper Structure

This paper contains 28 sections, 5 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: Illustration of Multi-modal Geospatial Data. Two satellite systems capture complementary modalities: optical imagery (left) with multiple spectral bands, and Synthetic Aperture Radar (SAR) imagery (right) with different polarization. Both modalities capture information across spatial and spectral dimensions. While our analysis focuses on optical and SAR data in this work, Earth observation systems can incorporate additional modalities such as thermal infrared and atmospheric measurements.
  • Figure 2: Illustration of Research Scope. In this work, we focus on static geospatial raster data in optical and SAR modalities and refer them as geospatial data. However, geospatial data in general can also exist as vector data, incorporate temporal dimensions, and span various other modalities.
  • Figure 3: Hyperspectral Patch Embedding. Hyperspectral images, with dimensions $C\times H \times W$, are embedded into spatial-spectral tokens through the Tied Patch Embedding Layer. We then prepend the Spatial, Spectral and global [CLS] tokens to the resulting patch tokens. Among which, the Spatial [CLS] tokens (yellow box) represent every spatial patch across the spectrum, Spectral [CLS] (green box) tokens represent the information of every channel and the global [CLS] token (red box) is the representation of all the spatial-spectral tokens.
  • Figure 4: Continuous Positional-Channel Embedding. The visualization demonstrates the embedding patterns across spatial positions and spectral bands ($\lambda$). For each wavelength, the embedding exhibits continuous spatial variation, while the overall pattern evolves systematically with increasing wavelength. The embedding encodes both spatial positions and spectral information in a unified representation.
  • Figure 5: LESS Attention Block. An illustration of the LESS Attention Block, which decomposes spatial and spectral attention computations and approximates the full spatial-spectral attention through a Kronecker product of the individual attention maps. Multiple LESS Attention Blocks connected in series with each block transforming input $x^{l}$ to produce the corresponding output $x^{l+1}$ for the subsequent layer.
  • ...and 5 more figures