A General Purpose Spectral Foundational Model for Both Proximal and Remote Sensing Spectral Imaging
William Michael Laprade, Jesper Cairo Westergaard, Svend Christensen, Mads Nielsen, Anders Bjorholm Dahl
TL;DR
The paper tackles the data scarcity of proximal hyperspectral imaging and the limitations of existing spectral foundation models by introducing a large-scale masked autoencoder-based spectral foundational model. It combines spectral channel encoding, a two-stage spatial-spectral masking strategy, and ImageNet pretraining to create a robust ViT-based encoder that can adapt to varying channel numbers. Pretraining on 16,518 hyperspectral images from proximal and remote sources enables effective finetuning on downstream tasks, with strong performance on proximal datasets like DriedFoods and Wheat and on remote sensing datasets such as BigEarthNet. The approach demonstrates robustness to channel-count variations and masking configurations, suggesting practical impact for accelerating spectral imaging analysis across diverse applications.
Abstract
Spectral imaging data acquired via multispectral and hyperspectral cameras can have hundreds of channels, where each channel records the reflectance at a specific wavelength and bandwidth. Time and resource constraints limit our ability to collect large spectral datasets, making it difficult to build and train predictive models from scratch. In the RGB domain, we can often alleviate some of the limitations of smaller datasets by using pretrained foundational models as a starting point. However, most existing foundation models are pretrained on large datasets of 3-channel RGB images, severely limiting their effectiveness when used with spectral imaging data. The few spectral foundation models that do exist usually have one of two limitations: (1) they are built and trained only on remote sensing data limiting their application in proximal spectral imaging, (2) they utilize the more widely available multispectral imaging datasets with less than 15 channels restricting their use with hundred-channel hyperspectral images. To alleviate these issues, we propose a large-scale foundational model and dataset built upon the masked autoencoder architecture that takes advantage of spectral channel encoding, spatial-spectral masking and ImageNet pretraining for an adaptable and robust model for downstream spectral imaging tasks.
