Table of Contents
Fetching ...

CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification

Asmit Bandyopadhyay, Anindita Das Bhattacharjee, Rakesh Das

TL;DR

This work tackles hyperspectral image classification by addressing high dimensionality and imbalanced data through a novel CNN-Transformer hybrid, CLAReSNet. It introduces Multi-Scale Spectral Latent Attention (MSLA), an adaptive latent-bottleneck mechanism that dramatically reduces attention complexity while preserving spectral-discriminative power. A CNN-based spatial stem, hybrid spectral positional encoding, and stacked spectral encoders with hierarchical cross-attention fusion yield highly separable embeddings and robust performance on challenging benchmarks. On Indian Pines and Salinas, CLAReSNet achieves state-of-the-art accuracies, demonstrating strong generalization across datasets with varying resolutions and spectral characteristics, and showing potential for scalable HSI analysis in remote sensing tasks.

Abstract

Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from $\mathcal{O}(T^2D)$ to $\mathcal{O}(T\log(T)D)$ by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet's effectiveness under severe class imbalance.

CLAReSNet: When Convolution Meets Latent Attention for Hyperspectral Image Classification

TL;DR

This work tackles hyperspectral image classification by addressing high dimensionality and imbalanced data through a novel CNN-Transformer hybrid, CLAReSNet. It introduces Multi-Scale Spectral Latent Attention (MSLA), an adaptive latent-bottleneck mechanism that dramatically reduces attention complexity while preserving spectral-discriminative power. A CNN-based spatial stem, hybrid spectral positional encoding, and stacked spectral encoders with hierarchical cross-attention fusion yield highly separable embeddings and robust performance on challenging benchmarks. On Indian Pines and Salinas, CLAReSNet achieves state-of-the-art accuracies, demonstrating strong generalization across datasets with varying resolutions and spectral characteristics, and showing potential for scalable HSI analysis in remote sensing tasks.

Abstract

Hyperspectral image (HSI) classification faces critical challenges, including high spectral dimensionality, complex spectral-spatial correlations, and limited training samples with severe class imbalance. While CNNs excel at local feature extraction and transformers capture long-range dependencies, their isolated application yields suboptimal results due to quadratic complexity and insufficient inductive biases. We propose CLAReSNet (Convolutional Latent Attention Residual Spectral Network), a hybrid architecture that integrates multi-scale convolutional extraction with transformer-style attention via an adaptive latent bottleneck. The model employs a multi-scale convolutional stem with deep residual blocks and an enhanced Convolutional Block Attention Module for hierarchical spatial features, followed by spectral encoder layers combining bidirectional RNNs (LSTM/GRU) with Multi-Scale Spectral Latent Attention (MSLA). MSLA reduces complexity from to by adaptive latent token allocation (8-64 tokens) that scales logarithmically with the sequence length. Hierarchical cross-attention fusion dynamically aggregates multi-level representations for robust classification. Experiments conducted on the Indian Pines and Salinas datasets show state-of-the-art performance, achieving overall accuracies of 99.71% and 99.96%, significantly surpassing HybridSN, SSRN, and SpectralFormer. The learned embeddings exhibit superior inter-class separability and compact intra-class clustering, validating CLAReSNet's effectiveness under severe class imbalance.

Paper Structure

This paper contains 17 sections, 26 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: (a) Indian Pines (RGB Image, Ground Truth, Classification Map, Uncertainty Map) and (b) Salinas (RGB Image, Ground Truth, Classification Map, Uncertainty Map).
  • Figure 2: Overview illustration of the proposed CLAReSNet model for HyperSpectral Image Classification
  • Figure 3: (a) Salinas Loss Curve, (b) Salinas Accuracy Curve, (c) Salinas Precision-Recall Curve, (d) Salinas TSNE Plot, (e) Indian Pines Loss Curve, (f) Indian Pines Accuracy Curve, (g) Indian Pines Precision-Recall Curve, (h) Indian Pines TSNE Plot