DiffFormer: a Differential Spatial-Spectral Transformer for Hyperspectral Image Classification
Muhammad Ahmad, Manuel Mazzara, Salvatore Distefano, Adil Mehmood Khan, Silvia Liberata Ullo
TL;DR
DiffFormer introduces a DiffFormer framework that tackles hyperspectral image classification by marrying a differential spatial-spectral attention mechanism with 3D patch-based tokenization and class-token aggregation. The Differential Multi-Head Self-Attention (DMHSA) highlights local spectral-spatial variations, while SWiGLU activations and sinusoidal positional encoding enhance nonlinear feature learning and continuity across bands. Extensive experiments on HC, UH, SA, and PU datasets demonstrate state-of-the-art accuracy (OA and kappa) and robust generalization, with thorough analyses of patch size, training data, depth, and heads confirming stability and scalability. The approach offers a practical HSIC solution with balanced accuracy and efficiency, and code availability is planned to facilitate broad adoption in real-time remote sensing applications.
Abstract
Hyperspectral image classification (HSIC) has gained significant attention because of its potential in analyzing high-dimensional data with rich spectral and spatial information. In this work, we propose the Differential Spatial-Spectral Transformer (DiffFormer), a novel framework designed to address the inherent challenges of HSIC, such as spectral redundancy and spatial discontinuity. The DiffFormer leverages a Differential Multi-Head Self-Attention (DMHSA) mechanism, which enhances local feature discrimination by introducing differential attention to accentuate subtle variations across neighboring spectral-spatial patches. The architecture integrates Spectral-Spatial Tokenization through three-dimensional (3D) convolution-based patch embeddings, positional encoding, and a stack of transformer layers equipped with the SWiGLU activation function for efficient feature extraction (SwiGLU is a variant of the Gated Linear Unit (GLU) activation function). A token-based classification head further ensures robust representation learning, enabling precise labeling of hyperspectral pixels. Extensive experiments on benchmark hyperspectral datasets demonstrate the superiority of DiffFormer in terms of classification accuracy, computational efficiency, and generalizability, compared to existing state-of-the-art (SOTA) methods. In addition, this work provides a detailed analysis of computational complexity, showcasing the scalability of the model for large-scale remote sensing applications. The source code will be made available at \url{https://github.com/mahmad000/DiffFormer} after the first round of revision.
