Table of Contents
Fetching ...

Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

Yichu Xu, Di Wang, Lefei Zhang, Liangpei Zhang

TL;DR

DSFormer tackles the challenge of fixed receptive fields and noisy self-attention in hyperspectral image classification by introducing two complementary blocks: Kernel Selective Fusion Transformer Block (KSFTB) for adaptive, multiscale spatial-spectral receptive fields, and Token Selective Fusion Transformer Block (TSFTB) for selective token fusion in self-attention. The model fuses features across scales and focuses on the most informative spatial-spectral tokens, reducing interference from irrelevant regions. Extensive experiments on four benchmark datasets demonstrate that DSFormer achieves state-of-the-art accuracies with competitive efficiency, notably OA $96.59\%$ on Pavia University, while maintaining favorable parameter and runtime profiles. These results underscore the practical impact of adaptive receptive field learning and token-level attention sparsification for robust HSI interpretation and land-cover mapping.

Abstract

Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial-spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model's ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial-spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at https://github.com/YichuXu/DSFormer.

Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

TL;DR

DSFormer tackles the challenge of fixed receptive fields and noisy self-attention in hyperspectral image classification by introducing two complementary blocks: Kernel Selective Fusion Transformer Block (KSFTB) for adaptive, multiscale spatial-spectral receptive fields, and Token Selective Fusion Transformer Block (TSFTB) for selective token fusion in self-attention. The model fuses features across scales and focuses on the most informative spatial-spectral tokens, reducing interference from irrelevant regions. Extensive experiments on four benchmark datasets demonstrate that DSFormer achieves state-of-the-art accuracies with competitive efficiency, notably OA on Pavia University, while maintaining favorable parameter and runtime profiles. These results underscore the practical impact of adaptive receptive field learning and token-level attention sparsification for robust HSI interpretation and land-cover mapping.

Abstract

Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial-spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model's ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial-spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at https://github.com/YichuXu/DSFormer.
Paper Structure (16 sections, 16 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 16 equations, 13 figures, 7 tables, 1 algorithm.

Figures (13)

  • Figure 1: The challenges existing in current HSI classification methods. (a) Fixed and limited receptive fields in HSI classification can potentially lead to misclassifications, such as confusing asphalt to bricks. (b) Irrelevant tokens may introduce noise which may potentially impair classification accuracy.
  • Figure 2: An illustration of the proposed DSFormer. The Dual Selective Fusion Transformer Group (DSFTG) is composed of a Kernel Selective Fusion Transformer Block (KSFTB) and three consecutive Token Selective Fusion Transformer Blocks (TSFTBs). Followed by two DSFTGs, we adopt the fully connected layer as the classification head to obtain the classification results.
  • Figure 3: The proposed Kernel Selective Fusion Attention (KSFA). Here, DwConv represents Depth-wise convolution, Avg and Max represent channel-wise average pooling and max pooling, respectively, and GAP means spatial global average pooling.
  • Figure 4: The illustration of the proposed Token Selective Fusion Attention (TSFA). The input feature $P$ is first grouped and then fed through a 3D point-wise convolution, followed by a 3D depth-wise convolution to generate the corresponding $\mathbf{Q}\mathbf{K}\mathbf{V}$. Subsequently, a token selection mechanism is utilized for further self-attention operation.
  • Figure 5: Comparison of classification performance across four datasets under different values of $k$.
  • ...and 8 more figures