Table of Contents
Fetching ...

Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model

Yimin Zhu, Lincoln Linlin Xu

TL;DR

This work tackles imbalanced-small-sample hyperspectral image classification by introducing Txt2HSI-LDM(VAE), a language-informed, semi-supervised latent diffusion framework. Data are compressed into a low-dimensional latent space via a VAE, then generated and guided by text prompts through a Transformer-based latent diffusion model with visual–linguistic cross-attention, and finally decoded back to hyperspectral space for classification. The approach expands training data with text-conditioned synthetic samples, leverages unlabeled data, and employs RPSC and LF-UE to emulate spatial mixing, achieving state-of-the-art results on Indian Pines, Pavia University, and Houston 2018 datasets. The model reduces manual annotation effort while improving generalization and class balance, with promising implications for ISSD HSIC in remote sensing applications.

Abstract

Data augmentation effectively addresses the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC). While most methodologies extend features in the latent space, few leverage text-driven generation to create realistic and diverse samples. Recently, text-guided diffusion models have gained significant attention due to their ability to generate highly diverse and high-quality images based on text prompts in natural image synthesis. Motivated by this, this paper proposes Txt2HSI-LDM(VAE), a novel language-informed hyperspectral image synthesis method to address the ISSD in HSIC. The proposed approach uses a denoising diffusion model, which iteratively removes Gaussian noise to generate hyperspectral samples conditioned on textual descriptions. First, to address the high-dimensionality of hyperspectral data, a universal variational autoencoder (VAE) is designed to map the data into a low-dimensional latent space, which provides stable features and reduces the inference complexity of diffusion model. Second, a semi-supervised diffusion model is designed to fully take advantage of unlabeled data. Random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are used to simulate the varying degrees of mixing. Third, the VAE decodes HSI from latent space generated by the diffusion model with the language conditions as input. In our experiments, we fully evaluate synthetic samples' effectiveness from statistical characteristics and data distribution in 2D-PCA space. Additionally, visual-linguistic cross-attention is visualized on the pixel level to prove that our proposed model can capture the spatial layout and geometry of the generated data. Experiments demonstrate that the performance of the proposed Txt2HSI-LDM(VAE) surpasses the classical backbone models, state-of-the-art CNNs, and semi-supervised methods.

Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model

TL;DR

This work tackles imbalanced-small-sample hyperspectral image classification by introducing Txt2HSI-LDM(VAE), a language-informed, semi-supervised latent diffusion framework. Data are compressed into a low-dimensional latent space via a VAE, then generated and guided by text prompts through a Transformer-based latent diffusion model with visual–linguistic cross-attention, and finally decoded back to hyperspectral space for classification. The approach expands training data with text-conditioned synthetic samples, leverages unlabeled data, and employs RPSC and LF-UE to emulate spatial mixing, achieving state-of-the-art results on Indian Pines, Pavia University, and Houston 2018 datasets. The model reduces manual annotation effort while improving generalization and class balance, with promising implications for ISSD HSIC in remote sensing applications.

Abstract

Data augmentation effectively addresses the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC). While most methodologies extend features in the latent space, few leverage text-driven generation to create realistic and diverse samples. Recently, text-guided diffusion models have gained significant attention due to their ability to generate highly diverse and high-quality images based on text prompts in natural image synthesis. Motivated by this, this paper proposes Txt2HSI-LDM(VAE), a novel language-informed hyperspectral image synthesis method to address the ISSD in HSIC. The proposed approach uses a denoising diffusion model, which iteratively removes Gaussian noise to generate hyperspectral samples conditioned on textual descriptions. First, to address the high-dimensionality of hyperspectral data, a universal variational autoencoder (VAE) is designed to map the data into a low-dimensional latent space, which provides stable features and reduces the inference complexity of diffusion model. Second, a semi-supervised diffusion model is designed to fully take advantage of unlabeled data. Random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are used to simulate the varying degrees of mixing. Third, the VAE decodes HSI from latent space generated by the diffusion model with the language conditions as input. In our experiments, we fully evaluate synthetic samples' effectiveness from statistical characteristics and data distribution in 2D-PCA space. Additionally, visual-linguistic cross-attention is visualized on the pixel level to prove that our proposed model can capture the spatial layout and geometry of the generated data. Experiments demonstrate that the performance of the proposed Txt2HSI-LDM(VAE) surpasses the classical backbone models, state-of-the-art CNNs, and semi-supervised methods.

Paper Structure

This paper contains 47 sections, 19 equations, 17 figures, 11 tables, 2 algorithms.

Figures (17)

  • Figure 1: An overview of Txt2HSI-LDM(VAE). First, a VAE is trained on all the cropped patch HSI data, and the dimension-reduced data is used for the second stage. Second, a language-informed conditional diffusion model, Txt2HSI-LDM is trained on limited labeled data and unlabeled data in a semi-supervised way to generate synthetic images given random language descriptions. Finally, the classifier zhuym2023 is trained or fine-tuned on ISSD data expanded by generated images with labels. EMA means exponential moving average to transfer parameters from the base model to the ensemble model. The language encoder uses the pre-trained parameter of the CLIP model named 'ViT-B-32.pt' CLIP2021, and we fine-tune it together with the diffusion model.
  • Figure 2: Illustration of the variational autoencoder (VAE), which consists of an encoder network, a decoder network, a reparameterization part, and a discriminator network. The detailed architectures of the encoder and decoder blocks are also depicted.
  • Figure 3: Latent diffusion forward and backward process. $q(z_{\boldsymbol{t}}|z_{\boldsymbol{t-1}})$, $p_{\theta}(z_{\boldsymbol{t-1}}|z_{\boldsymbol{t}})$ represent the noise-adding forward process and denoising backward process, respectively. The essential question is to estimate the conditional probability $q(z_{\boldsymbol{t-1}}|z_{\boldsymbol{t}})$. $z_T$ is nearly the pure Gaussian noise.
  • Figure 4: Illustration of the Transformer Diffusion Network (Upper) and details of blocks (Bottom).
  • Figure 5: Illustration of the Cross-Attention
  • ...and 12 more figures