Table of Contents
Fetching ...

DiffFormer: a Differential Spatial-Spectral Transformer for Hyperspectral Image Classification

Muhammad Ahmad, Manuel Mazzara, Salvatore Distefano, Adil Mehmood Khan, Silvia Liberata Ullo

TL;DR

DiffFormer introduces a DiffFormer framework that tackles hyperspectral image classification by marrying a differential spatial-spectral attention mechanism with 3D patch-based tokenization and class-token aggregation. The Differential Multi-Head Self-Attention (DMHSA) highlights local spectral-spatial variations, while SWiGLU activations and sinusoidal positional encoding enhance nonlinear feature learning and continuity across bands. Extensive experiments on HC, UH, SA, and PU datasets demonstrate state-of-the-art accuracy (OA and kappa) and robust generalization, with thorough analyses of patch size, training data, depth, and heads confirming stability and scalability. The approach offers a practical HSIC solution with balanced accuracy and efficiency, and code availability is planned to facilitate broad adoption in real-time remote sensing applications.

Abstract

Hyperspectral image classification (HSIC) has gained significant attention because of its potential in analyzing high-dimensional data with rich spectral and spatial information. In this work, we propose the Differential Spatial-Spectral Transformer (DiffFormer), a novel framework designed to address the inherent challenges of HSIC, such as spectral redundancy and spatial discontinuity. The DiffFormer leverages a Differential Multi-Head Self-Attention (DMHSA) mechanism, which enhances local feature discrimination by introducing differential attention to accentuate subtle variations across neighboring spectral-spatial patches. The architecture integrates Spectral-Spatial Tokenization through three-dimensional (3D) convolution-based patch embeddings, positional encoding, and a stack of transformer layers equipped with the SWiGLU activation function for efficient feature extraction (SwiGLU is a variant of the Gated Linear Unit (GLU) activation function). A token-based classification head further ensures robust representation learning, enabling precise labeling of hyperspectral pixels. Extensive experiments on benchmark hyperspectral datasets demonstrate the superiority of DiffFormer in terms of classification accuracy, computational efficiency, and generalizability, compared to existing state-of-the-art (SOTA) methods. In addition, this work provides a detailed analysis of computational complexity, showcasing the scalability of the model for large-scale remote sensing applications. The source code will be made available at \url{https://github.com/mahmad000/DiffFormer} after the first round of revision.

DiffFormer: a Differential Spatial-Spectral Transformer for Hyperspectral Image Classification

TL;DR

DiffFormer introduces a DiffFormer framework that tackles hyperspectral image classification by marrying a differential spatial-spectral attention mechanism with 3D patch-based tokenization and class-token aggregation. The Differential Multi-Head Self-Attention (DMHSA) highlights local spectral-spatial variations, while SWiGLU activations and sinusoidal positional encoding enhance nonlinear feature learning and continuity across bands. Extensive experiments on HC, UH, SA, and PU datasets demonstrate state-of-the-art accuracy (OA and kappa) and robust generalization, with thorough analyses of patch size, training data, depth, and heads confirming stability and scalability. The approach offers a practical HSIC solution with balanced accuracy and efficiency, and code availability is planned to facilitate broad adoption in real-time remote sensing applications.

Abstract

Hyperspectral image classification (HSIC) has gained significant attention because of its potential in analyzing high-dimensional data with rich spectral and spatial information. In this work, we propose the Differential Spatial-Spectral Transformer (DiffFormer), a novel framework designed to address the inherent challenges of HSIC, such as spectral redundancy and spatial discontinuity. The DiffFormer leverages a Differential Multi-Head Self-Attention (DMHSA) mechanism, which enhances local feature discrimination by introducing differential attention to accentuate subtle variations across neighboring spectral-spatial patches. The architecture integrates Spectral-Spatial Tokenization through three-dimensional (3D) convolution-based patch embeddings, positional encoding, and a stack of transformer layers equipped with the SWiGLU activation function for efficient feature extraction (SwiGLU is a variant of the Gated Linear Unit (GLU) activation function). A token-based classification head further ensures robust representation learning, enabling precise labeling of hyperspectral pixels. Extensive experiments on benchmark hyperspectral datasets demonstrate the superiority of DiffFormer in terms of classification accuracy, computational efficiency, and generalizability, compared to existing state-of-the-art (SOTA) methods. In addition, this work provides a detailed analysis of computational complexity, showcasing the scalability of the model for large-scale remote sensing applications. The source code will be made available at \url{https://github.com/mahmad000/DiffFormer} after the first round of revision.

Paper Structure

This paper contains 14 sections, 9 equations, 10 figures, 6 tables.

Figures (10)

  • Figure 1: Schematic representation of the DiffFormer pipeline for HSIC. The pipeline starts with hyperspectral data preprocessing, where fused patches are generated and spatial-spectral features are extracted. Differential Attention is employed within the encoder to refine the attention mechanism by integrating positional embeddings (PE) for enhanced spectral-spatial relationships. The hierarchical encoding layers aggregate the learned features across $L-1$ layers, enabling multi-scale representation learning. The model's effectiveness is evaluated on HSIC tasks, demonstrating its ability to accurately delineate class boundaries in hyperspectral datasets.
  • Figure 2: Time comparison for different patch sizes across four datasets. The x-axis represents the patch size, while the y-axis denotes the time taken for processing in seconds.
  • Figure 3: Classification performance of different percentage of training samples. The results demonstrate the impact of training samples with $12 \times 12$ patch size.
  • Figure 4: Impact of transformer layer depth on OA for HSIC. The bar plot shows the OA achieved by the DiffFormer model across six transformer layers for four datasets using a patch size of 12 $\times$ 12.
  • Figure 5: Overall Accuracy and Time for Different Heads
  • ...and 5 more figures