Table of Contents
Fetching ...

3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification

Haizhao Jing, Liuwei Wan, Xizhe Xue, Haokui Zhang, Ying Li

TL;DR

HSI classification faces a trade-off between local feature extraction and global context modeling, with ViT offering strong global reasoning but high computational cost and data requirements. The paper presents 3D-RCNet, a four-stage network that injects a 3D relational convolutional block into deeper layers, effectively blending ConvNet efficiency with Transformer-like attention inside a localized window. The 3D-RCBlock computes attention within a 3×3×3 window using the center voxel as the query, producing dynamic kernels and achieving lower MACs than full self-attention ($2\times(H\times W\times S)^2\times C$) while preserving translation invariance. Empirically, 3D-RCNet outperforms state-of-the-art ConvNet- and ViT-based HSIs methods across Indian Pines, Pavia University, and Houston 2013, with ablations confirming the benefits of deeper-stage deployment, a 3×3 kernel default, and the block’s dynamic kernel behavior, highlighting its practical impact for efficient, robust hyperspectral classification in data-constrained settings.

Abstract

Recently, the Vision Transformer (ViT) model has replaced the classical Convolutional Neural Network (ConvNet) in various computer vision tasks due to its superior performance. Even in hyperspectral image (HSI) classification field, ViT-based methods also show promising potential. Nevertheless, ViT encounters notable difficulties in processing HSI data. Its self-attention mechanism, which exhibits quadratic complexity, escalates computational costs. Additionally, ViT's substantial demand for training samples does not align with the practical constraints posed by the expensive labeling of HSI data. To overcome these challenges, we propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT, resulting in high performance in HSI classification. We embed the self-attention mechanism of Transformer into the convolutional operation of ConvNet to design 3D relational convolutional operation and use it to build the final 3D-RCNet. The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT. Additionally, the proposed 3D relational convolutional operation is a plug-and-play operation, which can be inserted into previous ConvNet-based HSI classification methods seamlessly. Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches.

3D-RCNet: Learning from Transformer to Build a 3D Relational ConvNet for Hyperspectral Image Classification

TL;DR

HSI classification faces a trade-off between local feature extraction and global context modeling, with ViT offering strong global reasoning but high computational cost and data requirements. The paper presents 3D-RCNet, a four-stage network that injects a 3D relational convolutional block into deeper layers, effectively blending ConvNet efficiency with Transformer-like attention inside a localized window. The 3D-RCBlock computes attention within a 3×3×3 window using the center voxel as the query, producing dynamic kernels and achieving lower MACs than full self-attention () while preserving translation invariance. Empirically, 3D-RCNet outperforms state-of-the-art ConvNet- and ViT-based HSIs methods across Indian Pines, Pavia University, and Houston 2013, with ablations confirming the benefits of deeper-stage deployment, a 3×3 kernel default, and the block’s dynamic kernel behavior, highlighting its practical impact for efficient, robust hyperspectral classification in data-constrained settings.

Abstract

Recently, the Vision Transformer (ViT) model has replaced the classical Convolutional Neural Network (ConvNet) in various computer vision tasks due to its superior performance. Even in hyperspectral image (HSI) classification field, ViT-based methods also show promising potential. Nevertheless, ViT encounters notable difficulties in processing HSI data. Its self-attention mechanism, which exhibits quadratic complexity, escalates computational costs. Additionally, ViT's substantial demand for training samples does not align with the practical constraints posed by the expensive labeling of HSI data. To overcome these challenges, we propose a 3D relational ConvNet named 3D-RCNet, which inherits both strengths of ConvNet and ViT, resulting in high performance in HSI classification. We embed the self-attention mechanism of Transformer into the convolutional operation of ConvNet to design 3D relational convolutional operation and use it to build the final 3D-RCNet. The proposed 3D-RCNet maintains the high computational efficiency of ConvNet while enjoying the flexibility of ViT. Additionally, the proposed 3D relational convolutional operation is a plug-and-play operation, which can be inserted into previous ConvNet-based HSI classification methods seamlessly. Empirical evaluations on three representative benchmark HSI datasets show that the proposed model outperforms previous ConvNet-based and ViT-based HSI approaches.
Paper Structure (17 sections, 5 equations, 9 figures, 6 tables)

This paper contains 17 sections, 5 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: The 3D-RCNet framework proposed by us, and the framework uses four stages of blocks for feature extraction at different depths on HSI data.
  • Figure 2: Comparison of the three methods, the total MACs required by each method with the same input. (a) is 3D-ConvBlock,(b) is Self-attention, and (c) is our proposed 3D-RCBlock.
  • Figure 3: False color composites of experimental HSI datasets and the ground truth of land cover type. (a) Indian Pines Dataset. (b) Pavia University Dataset. (c) Houston 2013 Dataset.
  • Figure 4: In comparative experiments conducted on the Indian Pine dataset, we visualize the prediction results. (a) is the Ground Truth. (b) is the prediction result of 3D CNN. (c) is the prediction result of LWNet. (d) is the prediction result of SSFTT. (e) is the prediction result of SpectralForm. (f) is the prediction result of GraphGST. (g) is the prediction result of our proposed 3D-RCNet.
  • Figure 5: In comparative experiments conducted on the Pavia University dataset, we visualize the prediction results. (a) is the Ground Truth. (b) is the prediction result of 3D CNN. (c) is the prediction result of LWNet. (d) is the prediction result of SSFTT. (e) is the prediction result of SpectralForm. (f) is the prediction result of GraphGST. (g) is the prediction result of our proposed 3D-RCNet.
  • ...and 4 more figures