Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

Wei Liu; Saurabh Prasad; Melba Crawford

Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

Wei Liu, Saurabh Prasad, Melba Crawford

TL;DR

From the evaluations conducted on various mixer models rooted in the unified architecture, it is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture, rather than being exclusively reliant on individual multihead self-attention (MSA) components.

Abstract

In the past three years, there has been significant interest in hyperspectral imagery (HSI) classification using vision Transformers for analysis of remotely sensed data. Previous research predominantly focused on the empirical integration of convolutional neural networks (CNNs) to augment the network's capability to extract local feature information. Yet, the theoretical justification for vision Transformers out-performing CNN architectures in HSI classification remains a question. To address this issue, a unified hierarchical spectral vision Transformer architecture, specifically tailored for HSI classification, is investigated. In this streamlined yet effective vision Transformer architecture, multiple mixer modules are strategically integrated separately. These include the CNN-mixer, which executes convolution operations; the spatial self-attention (SSA)-mixer and channel self-attention (CSA)-mixer, both of which are adaptations of classical self-attention blocks; and hybrid models such as the SSA+CNN-mixer and CSA+CNN-mixer, which merge convolution with self-attention operations. This integration facilitates the development of a broad spectrum of vision Transformer-based models tailored for HSI classification. In terms of the training process, a comprehensive analysis is performed, contrasting classical CNN models and vision Transformer-based counterparts, with particular attention to disturbance robustness and the distribution of the largest eigenvalue of the Hessian. From the evaluations conducted on various mixer models rooted in the unified architecture, it is concluded that the unique strength of vision Transformers can be attributed to their overarching architecture, rather than being exclusively reliant on individual multi-head self-attention (MSA) components.

Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

TL;DR

Abstract

Paper Structure (13 sections, 9 equations, 15 figures, 7 tables, 1 algorithm)

This paper contains 13 sections, 9 equations, 15 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Proposed Method
Overall architecture construction.
Mixer block options
Representation of the training process
Experimental Setup and Results
Dataset description and implementation detail
Comparison (baseline) methods
Model structure and complexity analysis
Experimental results
Conclusions
Acknowledgments

Figures (15)

Figure 1: Overall framework for HSI classification. The model consists of a unified architecture and mixer block options. The unified architecture is based on a novel hierarchical spectral vision Transformer, specifically tailored for HSI classification. Mixer block options include five common mixer blocks. When different mixers are individually chosen by the mixer blocks, it results in the creation of five unique Transformer models. The visualization in the bottom right corner demonstrates how the SSA-mixer and CSA-mixer can be easily converted on sequence inputs using the transpose operator. Img2Seq: transfer the image to sequence. LN: linear normalization. MLP: multilayer perceptron. CNN: convolutional neural network. SSA: spatial self-attention. CSA: channel self-attention. FCL: fully connected layer.
Figure 2: Houston 2013 dataset. (a) False color image (band R: 60, G: 45, B: 20). (b) Ground truth map.
Figure 3: Botswana dataset. (a) False color image (band R: 60, G:45, B: 15). (b) Ground truth map.
Figure 4: Pavia University dataset. (a) False color image (band R: 40, G: 30, B: 20). (b) Ground truth map.
Figure 5: Training patch size effect on the overall accuracy. (a) Houston 2013 dataset. (b) Botswana dataset. (c) Pavia University dataset.
...and 10 more figures

Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

TL;DR

Abstract

Investigation of Hierarchical Spectral Vision Transformer Architecture for Classification of Hyperspectral Imagery

Authors

TL;DR

Abstract

Table of Contents

Figures (15)