Table of Contents
Fetching ...

Label Semantics for Robust Hyperspectral Image Classification

Rafin Hassan, Zarin Tasnim Roshni, Rafiqul Bari, Alimul Islam, Nabeel Mohammed, Moshiur Farazi, Shafin Rahman

TL;DR

This work tackles the challenge of hyperspectral image classification under limited labeled data by introducing S3FN, a two-stage framework that fuses spectral–spatial features from a 3D-CNN with semantic label embeddings derived from rich, LLM-generated class descriptions. LLM prompts produce descriptive texts for each class, which are encoded by transformer-based text encoders to form semantic embeddings that guide alignment with image features. The architecture enables robust feature–label alignment and leverages patch-level voting to improve image predictions, with experiments on wood, blueberries, and DeepHS-Fruit datasets showing performance gains and insights into encoder choices. Overall, the approach demonstrates that contextual linguistic information can meaningfully augment hyperspectral classification, offering better generalization across diverse domains and paving the way for semantically guided HSI analysis in agriculture, environment, and beyond.

Abstract

Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN

Label Semantics for Robust Hyperspectral Image Classification

TL;DR

This work tackles the challenge of hyperspectral image classification under limited labeled data by introducing S3FN, a two-stage framework that fuses spectral–spatial features from a 3D-CNN with semantic label embeddings derived from rich, LLM-generated class descriptions. LLM prompts produce descriptive texts for each class, which are encoded by transformer-based text encoders to form semantic embeddings that guide alignment with image features. The architecture enables robust feature–label alignment and leverages patch-level voting to improve image predictions, with experiments on wood, blueberries, and DeepHS-Fruit datasets showing performance gains and insights into encoder choices. Overall, the approach demonstrates that contextual linguistic information can meaningfully augment hyperspectral classification, offering better generalization across diverse domains and paving the way for semantically guided HSI analysis in agriculture, environment, and beyond.

Abstract

Hyperspectral imaging (HSI) classification is a critical tool with widespread applications across diverse fields such as agriculture, environmental monitoring, medicine, and materials science. Due to the limited availability of high-quality training samples and the high dimensionality of spectral data, HSI classification models are prone to overfitting and often face challenges in balancing accuracy and computational complexity. Furthermore, most of HSI classification models are monomodal, where it solely relies on spectral-spatial data to learn decision boundaries in the high dimensional embedding space. To address this, we propose a general-purpose Semantic Spectral-Spatial Fusion Network (S3FN) that uses contextual, class specific textual descriptions to complement the training of an HSI classification model. Specifically, S3FN leverages LLMs to generate comprehensive textual descriptions for each class label that captures their unique characteristics and spectral behaviors. These descriptions are then embedded into a vector space using a pre-trained text encoder such as BERT or RoBERTa to extract meaningful label semantics which in turn leads to a better feature-label alignment for improved classification performance. To demonstrate the effectiveness of our approach, we evaluate our model on three diverse HSI benchmark datasets - Hyperspectral Wood, HyperspectralBlueberries, and DeepHS-Fruit and report significant performance boost. Our results highlight the synergy between textual semantics and spectral-spatial data, paving the way for further advancements in semantically augmented HSI classification models. Codes are be available in: https://github.com/milab-nsu/S3FN

Paper Structure

This paper contains 12 sections, 11 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Quality assessment (Good/Defective) of blueberries using hyperspectral imaging (HSI). (a) A Good blueberry exhibits high reflectance in the near-infrared (NIR) region, with characteristic dips corresponding to water, chlorophyll, and anthocyanins at specific wavelengths. (b) A Defective Blueberry shows altered spectral patterns due to cellular structure changes, surface texture variations, and biochemical shifts. The generated patches preserve key regions in the hyperspectral images, thus enabling the LLM generated descriptions of differences to be relevant to the patch.
  • Figure 2: Proposed S3FN Architecture. The Semantic Spectral-Spatial Fusion Network (S3FN) integrates spectral-spatial feature extraction with semantically enriched label embeddings for robust hyperspectral image (HSI) classification through four key stages: (a) For each class $y \in \mathcal{Y}$, comprehensive textual descriptions $T_y$ generated via LLM prompts (as shown in Table \ref{['tab:description_table']}) are encoded into semantic label embeddings $\mathbf{E} = \{\mathbf{e}_y \in \mathbb{R}^d \mid y \in \mathcal{Y}\}$ using pre-trained text encoders (e.g., BERT or RoBERTa). (b) Each HSI image $\mathbf{X}_i \in \mathbb{R}^{H \times W \times C}$ is partitioned into $M$ non-overlapping patches $\mathbf{P}^i_j \in \mathbb{R}^{32 \times 32 \times C}$, followed by global PCA to reduce spectral dimensionality to $C' \ll C$, yielding compressed patches $\mathbf{P'}_{i,j} \in \mathbb{R}^{32 \times 32 \times C'}$. A pretrained 3D CNN $f_\theta: \mathbb{R}^{32 \times 32 \times C'} \rightarrow \mathbb{R}^{d'}$, as shown in (d), extracts spectral-spatial features $\mathbf{z}_{i,j} = f_\theta(\mathbf{P'}_{i,j})$, which are projected to dimension $d$ via a multilayer perceptron (MLP), producing aligned embeddings $\mathbf{z}'_{i,j} \in \mathbb{R}^d$. (c) Semantic alignment computes similarity scores $s^y_{i,j} = \mathbf{z}'_{i,j} \cdot \mathbf{e}_y$ for each class $y$, normalizes probabilities via softmax, $p(y \mid \mathbf{P}'^i_j)$ and aggregates patch-level predictions through majority voting to determine the final class label for $\mathbf{X}_i$.
  • Figure 3: Mean spectral reflectance curves for two unique samples of wood (left: heartwood and sapwood) and blueberries (right: healthy/good and bad/defective), illustrating key absorption and reflectance features. The spectral mean reflectance curve for each sample was computed as described in Section \ref{['subsec:Proposed_Architecture']}, averaging reflectance values across all pixels for each spectral band. The overlaid semantic descriptions, generated by a LLM, capture class-specific spectral characteristics (highlighted in color-coded regions), enhancing label embeddings for robust hyperspectral image classification. For example, Heartwood typically displays lower reflectance, particularly in the blue (450 nm) region, due to the presence of phenolic compounds.