Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

Yichu Xu; Di Wang; Lefei Zhang; Liangpei Zhang

Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

Yichu Xu, Di Wang, Lefei Zhang, Liangpei Zhang

TL;DR

DSFormer tackles the challenge of fixed receptive fields and noisy self-attention in hyperspectral image classification by introducing two complementary blocks: Kernel Selective Fusion Transformer Block (KSFTB) for adaptive, multiscale spatial-spectral receptive fields, and Token Selective Fusion Transformer Block (TSFTB) for selective token fusion in self-attention. The model fuses features across scales and focuses on the most informative spatial-spectral tokens, reducing interference from irrelevant regions. Extensive experiments on four benchmark datasets demonstrate that DSFormer achieves state-of-the-art accuracies with competitive efficiency, notably OA $96.59\%$ on Pavia University, while maintaining favorable parameter and runtime profiles. These results underscore the practical impact of adaptive receptive field learning and token-level attention sparsification for robust HSI interpretation and land-cover mapping.

Abstract

Transformer has achieved satisfactory results in the field of hyperspectral image (HSI) classification. However, existing Transformer models face two key challenges when dealing with HSI scenes characterized by diverse land cover types and rich spectral information: (1) A fixed receptive field overlooks the effective contextual scales required by various HSI objects; (2) invalid self-attention features in context fusion affect model performance. To address these limitations, we propose a novel Dual Selective Fusion Transformer Network (DSFormer) for HSI classification. DSFormer achieves joint spatial and spectral contextual modeling by flexibly selecting and fusing features across different receptive fields, effectively reducing unnecessary information interference by focusing on the most relevant spatial-spectral tokens. Specifically, we design a Kernel Selective Fusion Transformer Block (KSFTB) to learn an optimal receptive field by adaptively fusing spatial and spectral features across different scales, enhancing the model's ability to accurately identify diverse HSI objects. Additionally, we introduce a Token Selective Fusion Transformer Block (TSFTB), which strategically selects and combines essential tokens during the spatial-spectral self-attention fusion process to capture the most crucial contexts. Extensive experiments conducted on four benchmark HSI datasets demonstrate that the proposed DSFormer significantly improves land cover classification accuracy, outperforming existing state-of-the-art methods. Specifically, DSFormer achieves overall accuracies of 96.59%, 97.66%, 95.17%, and 94.59% in the Pavia University, Houston, Indian Pines, and Whu-HongHu datasets, respectively, reflecting improvements of 3.19%, 1.14%, 0.91%, and 2.80% over the previous model. The code will be available online at https://github.com/YichuXu/DSFormer.

Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

TL;DR

on Pavia University, while maintaining favorable parameter and runtime profiles. These results underscore the practical impact of adaptive receptive field learning and token-level attention sparsification for robust HSI interpretation and land-cover mapping.

Abstract

Paper Structure (16 sections, 16 equations, 13 figures, 7 tables, 1 algorithm)

This paper contains 16 sections, 16 equations, 13 figures, 7 tables, 1 algorithm.

Introduction
Related Works
CNN-based Methods
Transformer-based Methods
Methodology
Overview of DSFormer
Kernel Selective Fusion Transformer Block
Token Selective Fusion Transformer Block
Experiment
Data Descriptions
Experimental Settings
Parameter Analysis
Ablation Study
Performance Comparison and Analysis
Visualization
...and 1 more sections

Figures (13)

Figure 1: The challenges existing in current HSI classification methods. (a) Fixed and limited receptive fields in HSI classification can potentially lead to misclassifications, such as confusing asphalt to bricks. (b) Irrelevant tokens may introduce noise which may potentially impair classification accuracy.
Figure 2: An illustration of the proposed DSFormer. The Dual Selective Fusion Transformer Group (DSFTG) is composed of a Kernel Selective Fusion Transformer Block (KSFTB) and three consecutive Token Selective Fusion Transformer Blocks (TSFTBs). Followed by two DSFTGs, we adopt the fully connected layer as the classification head to obtain the classification results.
Figure 3: The proposed Kernel Selective Fusion Attention (KSFA). Here, DwConv represents Depth-wise convolution, Avg and Max represent channel-wise average pooling and max pooling, respectively, and GAP means spatial global average pooling.
Figure 4: The illustration of the proposed Token Selective Fusion Attention (TSFA). The input feature $P$ is first grouped and then fed through a 3D point-wise convolution, followed by a 3D depth-wise convolution to generate the corresponding $\mathbf{Q}\mathbf{K}\mathbf{V}$. Subsequently, a token selection mechanism is utilized for further self-attention operation.
Figure 5: Comparison of classification performance across four datasets under different values of $k$.
...and 8 more figures

Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

TL;DR

Abstract

Dual Selective Fusion Transformer Network for Hyperspectral Image Classification

Authors

TL;DR

Abstract

Table of Contents

Figures (13)