Table of Contents
Fetching ...

VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification

Abdellah Zakaria Sellam, Fadi Abdeladhim Zidi, Salah Eddine Bekhouche, Ihssen Houhou, Marouane Tliba, Cosimo Distante, Abdenour Hadid

TL;DR

VP-Hype is introduced, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture and addressing the label-scarcity problem by integrating dual-modal Visual and Textual Prompts.

Abstract

Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.

VP-Hype: A Hybrid Mamba-Transformer Framework with Visual-Textual Prompting for Hyperspectral Image Classification

TL;DR

VP-Hype is introduced, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture and addressing the label-scarcity problem by integrating dual-modal Visual and Textual Prompts.

Abstract

Accurate classification of hyperspectral imagery (HSI) is often frustrated by the tension between high-dimensional spectral data and the extreme scarcity of labeled training samples. While hierarchical models like LoLA-SpecViT have demonstrated the power of local windowed attention and parameter-efficient fine-tuning, the quadratic complexity of standard Transformers remains a barrier to scaling. We introduce VP-Hype, a framework that rethinks HSI classification by unifying the linear-time efficiency of State-Space Models (SSMs) with the relational modeling of Transformers in a novel hybrid architecture. Building on a robust 3D-CNN spectral front-end, VP-Hype replaces conventional attention blocks with a Hybrid Mamba-Transformer backbone to capture long-range dependencies with significantly reduced computational overhead. Furthermore, we address the label-scarcity problem by integrating dual-modal Visual and Textual Prompts that provide context-aware guidance for the feature extraction process. Our experimental evaluation demonstrates that VP-Hype establishes a new state of the art in low-data regimes. Specifically, with a training sample distribution of only 2\%, the model achieves Overall Accuracy (OA) of 99.69\% on the Salinas dataset and 99.45\% on the Longkou dataset. These results suggest that the convergence of hybrid sequence modeling and multi-modal prompting provides a robust path forward for high-performance, sample-efficient remote sensing.
Paper Structure (35 sections, 22 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 22 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Comparison of performance metrics across nine hyperspectral image (HSI) classification models. All scores are normalized relative to the proposed VP-Hype, which demonstrates superior performance across all datasets (IP, UP, SA, LK, HH, HC), overall accuracy metrics (AA, Kappa), and computational efficiency.
  • Figure 2: Architecture of the proposed VP-Hype framework. The model combines a hybrid Mamba–Transformer backbone with visual–textual prompting for hyperspectral image classification. Task-specific text prompts encoded by CLIP and learnable visual prompts are fused via Text Conditional Spatial Prompt (TCSP) blocks and injected at multiple network stages. The prompt-enhanced features are progressively downsampled and finally passed to a classification head for prediction.
  • Figure 3: Visualization of hyperspectral data cubes and corresponding ground‐truth classification maps for the WHU‑Hi‑HongH, WHU‑Hi‑Longkou, and Salinas datasets: (a) hyperspectral image cube; (b) ground‑truth map.
  • Figure 4: Comparative classification performance on standard hyperspectral benchmarks: (I) Salinas, (II) WHU-Hi-LongKou, and (III) WHU-Hi-HongHu datasets. Ground truth and method predictions are presented with challenging regions highlighted (red boxes).