Table of Contents
Fetching ...

LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging

Fadi Abdeladhim Zidi, Djamel Eddine Boukhari, Abdellah Zakaria Sellam, Abdelkrim Ouafi, Cosimo Distante, Salah Eddine Bekhouche, Abdelmalik Taleb-Ahmed

TL;DR

LoLA-SpecViT tackles hyperspectral image classification under severe label scarcity by marrying a lightweight 3D spectral front-end with a hierarchical, LoRA-enabled Transformer backbone. The approach introduces BandDropout, a spectral attention module, SwiGLU activations, and a cyclic LoRA rate scheduler to deliver high accuracy with far fewer trainable parameters. Across three benchmark datasets, it achieves state-of-the-art results with 2–10% labeled data and demonstrates robust visual classifications with sharp boundaries and low noise. The work offers a scalable, generalizable solution for real-world remote sensing applications in agriculture and environmental monitoring, with code available for reproducibility.

Abstract

Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80\% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91\% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.

LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging

TL;DR

LoLA-SpecViT tackles hyperspectral image classification under severe label scarcity by marrying a lightweight 3D spectral front-end with a hierarchical, LoRA-enabled Transformer backbone. The approach introduces BandDropout, a spectral attention module, SwiGLU activations, and a cyclic LoRA rate scheduler to deliver high accuracy with far fewer trainable parameters. Across three benchmark datasets, it achieves state-of-the-art results with 2–10% labeled data and demonstrates robust visual classifications with sharp boundaries and low noise. The work offers a scalable, generalizable solution for real-world remote sensing applications in agriculture and environmental monitoring, with code available for reproducibility.

Abstract

Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80\% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91\% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.

Paper Structure

This paper contains 41 sections, 36 equations, 3 figures, 10 tables, 3 algorithms.

Figures (3)

  • Figure 1: Architecture of the proposed LoLA-SpecViT for hyperspectral image classification.
  • Figure 2: Visualization of hyperspectral data cubes and corresponding ground‐truth classification maps for the WHU‑Hi‑HongH, WHU‑Hi‑Longkou, and Salinas datasets: (a) hyperspectral image cube; (b) ground‑truth map.
  • Figure 3: Visual classification results on three HSI benchmarks: (I) Salinas dataset(II) WHU‑Hi Longkou dataset(III) WHU‑Hi HongHu dataset