LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging

Fadi Abdeladhim Zidi; Djamel Eddine Boukhari; Abdellah Zakaria Sellam; Abdelkrim Ouafi; Cosimo Distante; Salah Eddine Bekhouche; Abdelmalik Taleb-Ahmed

LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging

Fadi Abdeladhim Zidi, Djamel Eddine Boukhari, Abdellah Zakaria Sellam, Abdelkrim Ouafi, Cosimo Distante, Salah Eddine Bekhouche, Abdelmalik Taleb-Ahmed

TL;DR

LoLA-SpecViT tackles hyperspectral image classification under severe label scarcity by marrying a lightweight 3D spectral front-end with a hierarchical, LoRA-enabled Transformer backbone. The approach introduces BandDropout, a spectral attention module, SwiGLU activations, and a cyclic LoRA rate scheduler to deliver high accuracy with far fewer trainable parameters. Across three benchmark datasets, it achieves state-of-the-art results with 2–10% labeled data and demonstrates robust visual classifications with sharp boundaries and low noise. The work offers a scalable, generalizable solution for real-world remote sensing applications in agriculture and environmental monitoring, with code available for reproducibility.

Abstract

Hyperspectral image classification remains a challenging task due to the high dimensionality of spectral data, significant inter-band redundancy, and the limited availability of annotated samples. While recent transformer-based models have improved the global modeling of spectral-spatial dependencies, their scalability and adaptability under label-scarce conditions remain limited. In this work, we propose \textbf{LoLA-SpecViT}(Low-rank adaptation Local Attention Spectral Vision Transformer), a lightweight spectral vision transformer that addresses these limitations through a parameter-efficient architecture tailored to the unique characteristics of hyperspectral imagery. Our model combines a 3D convolutional spectral front-end with local window-based self-attention, enhancing both spectral feature extraction and spatial consistency while reducing computational complexity. To further improve adaptability, we integrate low-rank adaptation (LoRA) into attention and projection layers, enabling fine-tuning with over 80\% fewer trainable parameters. A novel cyclical learning rate scheduler modulates LoRA adaptation strength during training, improving convergence and generalisation. Extensive experiments on three benchmark datasets WHU-Hi LongKou, WHU-Hi HongHu, and Salinas demonstrate that LoLA-SpecViT consistently outperforms state-of-the-art baselines, achieving up to 99.91\% accuracy with substantially fewer parameters and enhanced robustness under low-label regimes. The proposed framework provides a scalable and generalizable solution for real-world HSI applications in agriculture, environmental monitoring, and remote sensing analytics. Our code is available in the following \href{https://github.com/FadiZidiDz/LoLA-SpecViT}{GitHub Repository}.

LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging

TL;DR

Abstract

LoLA-SpecViT: Local Attention SwiGLU Vision Transformer with LoRA for Hyperspectral Imaging

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (3)