Table of Contents
Fetching ...

SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP

Li Pang, Jing Yao, Kaiyu Li, Jun Zhou, Deyu Meng, Xiangyong Cao

TL;DR

This work tackles zero-shot hyperspectral image classification by introducing SPECIAL, a two-stage framework that first uses spectral-to-RGB interpolation and CLIP-based open-vocabulary segmentation to generate pseudo-labels, then refines them through noise-robust spectral learning. The method leverages multi-scale resolution fusion and a Gaussian Mixture Model–driven soft-label refinement to mitigate label noise, with a warmup phase using a spectral classifier (MambaHSI) followed by a label-refinement phase that partitions samples into random, confident, and hard sets. Empirical results on three public HSIs—Pavia Centre, AeroRIT, and Chikusei—show consistent improvements over existing CLIP-based baselines in OA, AA, and $\kappa$, validating the effectiveness of incorporating full spectral information and probabilistic label refinement in zero-shot HSI classification. The approach is modular and data-efficient, offering practical potential for open-vocabulary hyperspectral interpretation without manual annotations.

Abstract

Hyperspectral image (HSI) classification aims to categorize each pixel in an HSI into a specific land cover class, which is crucial for applications such as remote sensing, environmental monitoring, and agriculture. Although deep learning-based HSI classification methods have achieved significant advancements, existing methods still rely on manually labeled data for training, which is both time-consuming and labor-intensive. To address this limitation, we introduce a novel zero-shot hyperspectral image classification framework based on CLIP (SPECIAL), aiming to eliminate the need for manual annotations. The SPECIAL framework consists of two main stages: (1) CLIP-based pseudo-label generation, and (2) noisy label learning. In the first stage, HSI is spectrally interpolated to produce RGB bands. These bands are subsequently classified using CLIP, resulting in noisy pseudo-labels that are accompanied by confidence scores. To improve the quality of these labels, we propose a scaling strategy that fuses predictions from multiple spatial scales. In the second stage, spectral information and a label refinement technique are incorporated to mitigate label noise and further enhance classification accuracy. Experimental results on three benchmark datasets demonstrate that our SPECIAL outperforms existing methods in zero-shot HSI classification, showing its potential for more practical applications. The code is available at https://github.com/LiPang/SPECIAL.

SPECIAL: Zero-shot Hyperspectral Image Classification With CLIP

TL;DR

This work tackles zero-shot hyperspectral image classification by introducing SPECIAL, a two-stage framework that first uses spectral-to-RGB interpolation and CLIP-based open-vocabulary segmentation to generate pseudo-labels, then refines them through noise-robust spectral learning. The method leverages multi-scale resolution fusion and a Gaussian Mixture Model–driven soft-label refinement to mitigate label noise, with a warmup phase using a spectral classifier (MambaHSI) followed by a label-refinement phase that partitions samples into random, confident, and hard sets. Empirical results on three public HSIs—Pavia Centre, AeroRIT, and Chikusei—show consistent improvements over existing CLIP-based baselines in OA, AA, and , validating the effectiveness of incorporating full spectral information and probabilistic label refinement in zero-shot HSI classification. The approach is modular and data-efficient, offering practical potential for open-vocabulary hyperspectral interpretation without manual annotations.

Abstract

Hyperspectral image (HSI) classification aims to categorize each pixel in an HSI into a specific land cover class, which is crucial for applications such as remote sensing, environmental monitoring, and agriculture. Although deep learning-based HSI classification methods have achieved significant advancements, existing methods still rely on manually labeled data for training, which is both time-consuming and labor-intensive. To address this limitation, we introduce a novel zero-shot hyperspectral image classification framework based on CLIP (SPECIAL), aiming to eliminate the need for manual annotations. The SPECIAL framework consists of two main stages: (1) CLIP-based pseudo-label generation, and (2) noisy label learning. In the first stage, HSI is spectrally interpolated to produce RGB bands. These bands are subsequently classified using CLIP, resulting in noisy pseudo-labels that are accompanied by confidence scores. To improve the quality of these labels, we propose a scaling strategy that fuses predictions from multiple spatial scales. In the second stage, spectral information and a label refinement technique are incorporated to mitigate label noise and further enhance classification accuracy. Experimental results on three benchmark datasets demonstrate that our SPECIAL outperforms existing methods in zero-shot HSI classification, showing its potential for more practical applications. The code is available at https://github.com/LiPang/SPECIAL.

Paper Structure

This paper contains 18 sections, 11 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overall framework of our proposed SPECIAL. The proposed framework consists of two stages: CLIP-based pseudo-labels generation (PLG) and noisy label learning (NLL). In the PLG stage, CLIP classifies interpolated RGB bands, generating pseudo-labels with confidence scores, while NLL further improves classification accuracy by incorporating spectral information with a label refinement strategy.
  • Figure 2: Comparison of prediction results at different image scales. (a) The original image. (b) Ground truth. (c) Prediction result without any image upsampling, where the model directly operates at the original resolution. (d) Prediction result with $2\times$ image upsampling before inference. The comparison shows that higher input resolution improves the detection of small objects such as individual cars and thin structures, while a lower resolution is more suitable for capturing large homogeneous areas such as continuous road regions. This indicates that different image scales emphasize complementary aspects of the scene.
  • Figure 3: Visualization of the classification maps produced by different approaches on the Pavia Centre dataset. (a) False-color image derived from the hyperspectral data. (b) Ground truth reference map. (c) CLIP. (d) MaskCLIP. (e) SCLIP. (f) GEM. (g) ClearCLIP. (h) SegEarth-OV. (i) The proposed method (Ours). The comparison shows that our method generates cleaner and more coherent segmentation maps, preserves object boundaries better, and reduces misclassifications in complex urban regions compared with existing CLIP-based or open-vocabulary baselines.
  • Figure 4: Visualization of the classification maps provided by different approaches on the AeroRIT dataset. (a) False color image. (b) Ground truth. (c) CLIP. (d) MaskCLIP. (e) SCLIP. (f) GEM. (g) ClearCLIP. (h) SegEarth-OV. (i) Ours.
  • Figure 5: Visualization of the classification maps provided by different approaches on the Chikusei dataset. (a) False color image. (b) Ground truth. (c) CLIP. (d) MaskCLIP. (e) SCLIP. (f) GEM. (g) ClearCLIP. (h) SegEarth-OV. (i) Ours.