Table of Contents
Fetching ...

A General Purpose Spectral Foundational Model for Both Proximal and Remote Sensing Spectral Imaging

William Michael Laprade, Jesper Cairo Westergaard, Svend Christensen, Mads Nielsen, Anders Bjorholm Dahl

TL;DR

The paper tackles the data scarcity of proximal hyperspectral imaging and the limitations of existing spectral foundation models by introducing a large-scale masked autoencoder-based spectral foundational model. It combines spectral channel encoding, a two-stage spatial-spectral masking strategy, and ImageNet pretraining to create a robust ViT-based encoder that can adapt to varying channel numbers. Pretraining on 16,518 hyperspectral images from proximal and remote sources enables effective finetuning on downstream tasks, with strong performance on proximal datasets like DriedFoods and Wheat and on remote sensing datasets such as BigEarthNet. The approach demonstrates robustness to channel-count variations and masking configurations, suggesting practical impact for accelerating spectral imaging analysis across diverse applications.

Abstract

Spectral imaging data acquired via multispectral and hyperspectral cameras can have hundreds of channels, where each channel records the reflectance at a specific wavelength and bandwidth. Time and resource constraints limit our ability to collect large spectral datasets, making it difficult to build and train predictive models from scratch. In the RGB domain, we can often alleviate some of the limitations of smaller datasets by using pretrained foundational models as a starting point. However, most existing foundation models are pretrained on large datasets of 3-channel RGB images, severely limiting their effectiveness when used with spectral imaging data. The few spectral foundation models that do exist usually have one of two limitations: (1) they are built and trained only on remote sensing data limiting their application in proximal spectral imaging, (2) they utilize the more widely available multispectral imaging datasets with less than 15 channels restricting their use with hundred-channel hyperspectral images. To alleviate these issues, we propose a large-scale foundational model and dataset built upon the masked autoencoder architecture that takes advantage of spectral channel encoding, spatial-spectral masking and ImageNet pretraining for an adaptable and robust model for downstream spectral imaging tasks.

A General Purpose Spectral Foundational Model for Both Proximal and Remote Sensing Spectral Imaging

TL;DR

The paper tackles the data scarcity of proximal hyperspectral imaging and the limitations of existing spectral foundation models by introducing a large-scale masked autoencoder-based spectral foundational model. It combines spectral channel encoding, a two-stage spatial-spectral masking strategy, and ImageNet pretraining to create a robust ViT-based encoder that can adapt to varying channel numbers. Pretraining on 16,518 hyperspectral images from proximal and remote sources enables effective finetuning on downstream tasks, with strong performance on proximal datasets like DriedFoods and Wheat and on remote sensing datasets such as BigEarthNet. The approach demonstrates robustness to channel-count variations and masking configurations, suggesting practical impact for accelerating spectral imaging analysis across diverse applications.

Abstract

Spectral imaging data acquired via multispectral and hyperspectral cameras can have hundreds of channels, where each channel records the reflectance at a specific wavelength and bandwidth. Time and resource constraints limit our ability to collect large spectral datasets, making it difficult to build and train predictive models from scratch. In the RGB domain, we can often alleviate some of the limitations of smaller datasets by using pretrained foundational models as a starting point. However, most existing foundation models are pretrained on large datasets of 3-channel RGB images, severely limiting their effectiveness when used with spectral imaging data. The few spectral foundation models that do exist usually have one of two limitations: (1) they are built and trained only on remote sensing data limiting their application in proximal spectral imaging, (2) they utilize the more widely available multispectral imaging datasets with less than 15 channels restricting their use with hundred-channel hyperspectral images. To alleviate these issues, we propose a large-scale foundational model and dataset built upon the masked autoencoder architecture that takes advantage of spectral channel encoding, spatial-spectral masking and ImageNet pretraining for an adaptable and robust model for downstream spectral imaging tasks.

Paper Structure

This paper contains 20 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: A masked autoencoder is used to pretrain a ViT encoder using hyperspectral data. The encoder can then be used for finetuning on a wide range of hyperspectral tasks across both proximal and remote sensing imagery.
  • Figure 2: Sample images from existing datasets in RGB.
  • Figure 3: Sample images from collected datasets in RGB.
  • Figure 4: Our MAE architecture. The hyperspectral image is divide into single channel $16 \times 16 \times 1$ patches. Masking removes a majority of patches and the remaining patches are fed into the encoder. Utilizing the encoded information from the visible patches the decoder attempts to reconstruct the masked patches. MSE loss is computed only on the reconstructed masked patches.
  • Figure 5: Two-stage masking. An initial column masking to remove entire spatial regions from the image followed by a channel masking to remove spectral information.
  • ...and 1 more figures