Language-Informed Hyperspectral Image Synthesis for Imbalanced-Small Sample Classification via Semi-Supervised Conditional Diffusion Model
Yimin Zhu, Lincoln Linlin Xu
TL;DR
This work tackles imbalanced-small-sample hyperspectral image classification by introducing Txt2HSI-LDM(VAE), a language-informed, semi-supervised latent diffusion framework. Data are compressed into a low-dimensional latent space via a VAE, then generated and guided by text prompts through a Transformer-based latent diffusion model with visual–linguistic cross-attention, and finally decoded back to hyperspectral space for classification. The approach expands training data with text-conditioned synthetic samples, leverages unlabeled data, and employs RPSC and LF-UE to emulate spatial mixing, achieving state-of-the-art results on Indian Pines, Pavia University, and Houston 2018 datasets. The model reduces manual annotation effort while improving generalization and class balance, with promising implications for ISSD HSIC in remote sensing applications.
Abstract
Data augmentation effectively addresses the imbalanced-small sample data (ISSD) problem in hyperspectral image classification (HSIC). While most methodologies extend features in the latent space, few leverage text-driven generation to create realistic and diverse samples. Recently, text-guided diffusion models have gained significant attention due to their ability to generate highly diverse and high-quality images based on text prompts in natural image synthesis. Motivated by this, this paper proposes Txt2HSI-LDM(VAE), a novel language-informed hyperspectral image synthesis method to address the ISSD in HSIC. The proposed approach uses a denoising diffusion model, which iteratively removes Gaussian noise to generate hyperspectral samples conditioned on textual descriptions. First, to address the high-dimensionality of hyperspectral data, a universal variational autoencoder (VAE) is designed to map the data into a low-dimensional latent space, which provides stable features and reduces the inference complexity of diffusion model. Second, a semi-supervised diffusion model is designed to fully take advantage of unlabeled data. Random polygon spatial clipping (RPSC) and uncertainty estimation of latent feature (LF-UE) are used to simulate the varying degrees of mixing. Third, the VAE decodes HSI from latent space generated by the diffusion model with the language conditions as input. In our experiments, we fully evaluate synthetic samples' effectiveness from statistical characteristics and data distribution in 2D-PCA space. Additionally, visual-linguistic cross-attention is visualized on the pixel level to prove that our proposed model can capture the spatial layout and geometry of the generated data. Experiments demonstrate that the performance of the proposed Txt2HSI-LDM(VAE) surpasses the classical backbone models, state-of-the-art CNNs, and semi-supervised methods.
