CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification
Asmit Bandyopadhyay, Anindita Das Bhattacharjee, Rakesh Das
TL;DR
This work tackles hyperspectral image classification by addressing high dimensionality and imbalanced data through a novel CNN-Transformer hybrid, CLAReSNet. It introduces Multi-Scale Spectral Latent Attention (MSLA), an adaptive latent-bottleneck mechanism that dramatically reduces attention complexity while preserving spectral-discriminative power. A CNN-based spatial stem, hybrid spectral positional encoding, and stacked spectral encoders with hierarchical cross-attention fusion yield highly separable embeddings and robust performance on challenging benchmarks. On Indian Pines and Salinas, CLAReSNet achieves state-of-the-art accuracies, demonstrating strong generalization across datasets with varying resolutions and spectral characteristics, and showing potential for scalable HSI analysis in remote sensing tasks.
Abstract
Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from $\mathcal{O}(T^2D)$ to $\mathcal{O}(T\log(T)D)$ by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet's effectiveness under severe class imbalance.
