Table of Contents
Fetching ...

FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pretraining

Shaheer Mohamed, Maryam Haghighat, Tharindu Fernando, Sridha Sridharan, Clinton Fookes, Peyman Moghadam

TL;DR

FactoFormer introduces a factorized spectral-spatial transformer architecture for hyperspectral image classification, paired with self-supervised pretraining via spectrally and spatially consistent masking. By independently modeling spectral and spatial interactions and then fusing their latent representations, it reduces quadratic complexity to $O(m)^2 + O(n)^2$ and achieves state-of-the-art performance across six public datasets with faster convergence than prior transformer methods. The approach demonstrates the value of explicit spectral-spatial factorization and lightweight decoding in self-supervised pretraining, addressing data scarcity while preserving fine-grained information. This work has practical implications for scalable, data-efficient hyperspectral learning and opens avenues for cross-domain adaptation and broader HSI applications.

Abstract

Hyperspectral images (HSIs) contain rich spectral and spatial information. Motivated by the success of transformers in the field of natural language processing and computer vision where they have shown the ability to learn long range dependencies within input data, recent research has focused on using transformers for HSIs. However, current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension, resulting in the under-utilization of spatial information. Moreover, transformers are known to be data-hungry and their performance relies heavily on large-scale pretraining, which is challenging due to limited annotated hyperspectral data. Therefore, the full potential of HSI transformers has not been fully realized. To overcome these limitations, we propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pretraining procedures, leading to significant improvements in performance. The factorization of the inputs allows the spectral and spatial transformers to better capture the interactions within the hyperspectral data cubes. Inspired by masked image modeling pretraining, we also devise efficient masking strategies for pretraining each of the spectral and spatial transformers. We conduct experiments on six publicly available datasets for HSI classification task and demonstrate that our model achieves state-of-the-art performance in all the datasets. The code for our model will be made available at https://github.com/csiro-robotics/factoformer.

FactoFormer: Factorized Hyperspectral Transformers with Self-Supervised Pretraining

TL;DR

FactoFormer introduces a factorized spectral-spatial transformer architecture for hyperspectral image classification, paired with self-supervised pretraining via spectrally and spatially consistent masking. By independently modeling spectral and spatial interactions and then fusing their latent representations, it reduces quadratic complexity to and achieves state-of-the-art performance across six public datasets with faster convergence than prior transformer methods. The approach demonstrates the value of explicit spectral-spatial factorization and lightweight decoding in self-supervised pretraining, addressing data scarcity while preserving fine-grained information. This work has practical implications for scalable, data-efficient hyperspectral learning and opens avenues for cross-domain adaptation and broader HSI applications.

Abstract

Hyperspectral images (HSIs) contain rich spectral and spatial information. Motivated by the success of transformers in the field of natural language processing and computer vision where they have shown the ability to learn long range dependencies within input data, recent research has focused on using transformers for HSIs. However, current state-of-the-art hyperspectral transformers only tokenize the input HSI sample along the spectral dimension, resulting in the under-utilization of spatial information. Moreover, transformers are known to be data-hungry and their performance relies heavily on large-scale pretraining, which is challenging due to limited annotated hyperspectral data. Therefore, the full potential of HSI transformers has not been fully realized. To overcome these limitations, we propose a novel factorized spectral-spatial transformer that incorporates factorized self-supervised pretraining procedures, leading to significant improvements in performance. The factorization of the inputs allows the spectral and spatial transformers to better capture the interactions within the hyperspectral data cubes. Inspired by masked image modeling pretraining, we also devise efficient masking strategies for pretraining each of the spectral and spatial transformers. We conduct experiments on six publicly available datasets for HSI classification task and demonstrate that our model achieves state-of-the-art performance in all the datasets. The code for our model will be made available at https://github.com/csiro-robotics/factoformer.
Paper Structure (20 sections, 9 equations, 8 figures, 15 tables)

This paper contains 20 sections, 9 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: An overview of the proposed FactoFormer architecture: a factorized transformer architecture for hyperspectral images. FactoFormer splits hyperspectral image cubes into non-overlapping tokenized patches along spectral and spatial dimensions and processes them with two transformers simultaneously, where attention in each transformer focuses on spectral and spatial dimensions. The outputs of each transformer get combined by concatenating them and passing them to a multi-layer perceptron to perform classification.
  • Figure 2: An overview of the proposed pretraining network for spectral transformer in FactoFormer.
  • Figure 3: An overview of the proposed pretraining network for spatial transformer in FactoFormer.
  • Figure 4: Ground truth and the classification maps obtained by different models on the Indian Pines dataset.
  • Figure 5: Ground truth and the classification maps obtained by different models on the University of Pavia dataset.
  • ...and 3 more figures