Table of Contents
Fetching ...

HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

Di Wang, Meiqi Hu, Yao Jin, Yuchun Miao, Jiaqi Yang, Yichu Xu, Xiaolei Qin, Jiaqi Ma, Lingyu Sun, Chenxing Li, Chuan Fu, Hongruixuan Chen, Chengxi Han, Naoto Yokoya, Jing Zhang, Minqiang Xu, Lin Liu, Lefei Zhang, Chen Wu, Bo Du, Dacheng Tao, Liangpei Zhang

TL;DR

HyperSIGMA addresses the challenge of universal hyperspectral interpretation across tasks and scenes by introducing a billion-parameter vision-transformer foundation model built on a large-scale self-supervised hyperspectral dataset (HyperGlobal-450K). It introduces Sparse Sampling Attention to efficiently learn diverse contexts and a Spectral Enhancement Module to fuse spatial and spectral features, forming two ViT subnetworks (SpatViT and SpecViT) whose weights are pre-trained with Masked Image Modeling. Extensive experiments show superior performance across 20 high-level and low-level HSI tasks, with strong scalability, robustness to limited data and noise, and notable cross-modal transferability. The work has practical implications for real-world earth observation applications, offering a unified, scalable, and efficient framework for hyperspectral interpretation.

Abstract

Accurate hyperspectral image (HSI) interpretation is critical for providing valuable insights into various earth observation-related applications such as urban planning, precision agriculture, and environmental monitoring. However, existing HSI processing methods are predominantly task-specific and scene-dependent, which severely limits their ability to transfer knowledge across tasks and scenes, thereby reducing the practicality in real-world applications. To address these challenges, we present HyperSIGMA, a vision transformer-based foundation model that unifies HSI interpretation across tasks and scenes, scalable to over one billion parameters. To overcome the spectral and spatial redundancy inherent in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, real-world applicability, and computational efficiency. The code and models will be released at https://github.com/WHU-Sigma/HyperSIGMA.

HyperSIGMA: Hyperspectral Intelligence Comprehension Foundation Model

TL;DR

HyperSIGMA addresses the challenge of universal hyperspectral interpretation across tasks and scenes by introducing a billion-parameter vision-transformer foundation model built on a large-scale self-supervised hyperspectral dataset (HyperGlobal-450K). It introduces Sparse Sampling Attention to efficiently learn diverse contexts and a Spectral Enhancement Module to fuse spatial and spectral features, forming two ViT subnetworks (SpatViT and SpecViT) whose weights are pre-trained with Masked Image Modeling. Extensive experiments show superior performance across 20 high-level and low-level HSI tasks, with strong scalability, robustness to limited data and noise, and notable cross-modal transferability. The work has practical implications for real-world earth observation applications, offering a unified, scalable, and efficient framework for hyperspectral interpretation.

Abstract

Accurate hyperspectral image (HSI) interpretation is critical for providing valuable insights into various earth observation-related applications such as urban planning, precision agriculture, and environmental monitoring. However, existing HSI processing methods are predominantly task-specific and scene-dependent, which severely limits their ability to transfer knowledge across tasks and scenes, thereby reducing the practicality in real-world applications. To address these challenges, we present HyperSIGMA, a vision transformer-based foundation model that unifies HSI interpretation across tasks and scenes, scalable to over one billion parameters. To overcome the spectral and spatial redundancy inherent in HSIs, we introduce a novel sparse sampling attention (SSA) mechanism, which effectively promotes the learning of diverse contextual features and serves as the basic block of HyperSIGMA. HyperSIGMA integrates spatial and spectral features using a specially designed spectral enhancement module. In addition, we construct a large-scale hyperspectral dataset, HyperGlobal-450K, for pre-training, which contains about 450K hyperspectral images, significantly surpassing existing datasets in scale. Extensive experiments on various high-level and low-level HSI tasks demonstrate HyperSIGMA's versatility and superior representational capability compared to current state-of-the-art methods. Moreover, HyperSIGMA shows significant advantages in scalability, robustness, cross-modal transferring capability, real-world applicability, and computational efficiency. The code and models will be released at https://github.com/WHU-Sigma/HyperSIGMA.
Paper Structure (87 sections, 11 equations, 37 figures, 31 tables)

This paper contains 87 sections, 11 equations, 37 figures, 31 tables.

Figures (37)

  • Figure 1: HyperSIGMA offers a universal solution for HSI processing, demonstrating superior performance across 20 datasets, including both high-level and low-level hyperspectral tasks, as well as multispectral scenes. It outperforms advanced models like SpectralGPT, even those specifically designed for these tasks. HIC: Hyperspectral Image Classification. HTD: Hyperspectral Target Detection. HAD: Hyperspectral Anomaly Detection. HCD: Hyperspectral Change Detection. HIU: Hyperspectral Image Unmixing. HID: Hyperspectral Image Denoising. HSR: Hyperspectral Super-Resolution. MCD: Multispectral Change Detection.
  • Figure 2: Previous HSI models are trained separately on different scenes, limiting cross-scene knowledge transfer. In contrast, our model acquires universal, scene-agnostic knowledge through pre-training with a large dataset of global HSIs, enabling effective transfer to various scenes through fine-tuning.
  • Figure 3: Comparison of RGB, Synthetic-aperture radar (SAR), multi-spectral, and hyperspectral images.
  • Figure 4: The distribution of HyperGlobal-450K samples across the globe. The sampled patches of typical landscapes from different regions, including forests, grasslands, barelands, and croplands, clearly exhibit the characteristics of their respective geographical regions.
  • Figure 5: Comparison of different attention mechanisms: (a) Full SA vit, (b) WMHSA swint, (c) RVSA rvsa, (d) NLSA nlsa, (e) DMHA dat, (f) SSA. Stars represent queries, with dots surrounded by corresponding colored lines indicating the attention regions of captured contexts. Green rectangles in (c) and (f) denote common areas shared by both queries. In DMHA, all queries share the same keys in the yellow region.
  • ...and 32 more figures