Spectral-Enhanced Transformers: Leveraging Large-Scale Pretrained Models for Hyperspectral Object Tracking
Shaheer Mohamed, Tharindu Fernando, Sridha Sridharan, Peyman Moghadam, Clinton Fookes
TL;DR
This work tackles hyperspectral object tracking with limited data by adapting large RGB pretrained transformers to hyperspectral inputs. It introduces an adaptive spatial-spectral token fusion that merges spatial (false-color) and spectral (hyperspectral) information, and employs cross-modality training to learn modality-invariant features across different sensors. The approach achieves state-of-the-art results on HOT2020 and HOT2024 datasets with only a few training epochs, demonstrating the practical viability of leveraging foundation models in hyperspectral tracking. Overall, the method offers a scalable pathway to exploit large pretrained transformers for hyperspectral tasks, enabling robust tracking across diverse sensor modalities.
Abstract
Hyperspectral object tracking using snapshot mosaic cameras is emerging as it provides enhanced spectral information alongside spatial data, contributing to a more comprehensive understanding of material properties. Using transformers, which have consistently outperformed convolutional neural networks (CNNs) in learning better feature representations, would be expected to be effective for Hyperspectral object tracking. However, training large transformers necessitates extensive datasets and prolonged training periods. This is particularly critical for complex tasks like object tracking, and the scarcity of large datasets in the hyperspectral domain acts as a bottleneck in achieving the full potential of powerful transformer models. This paper proposes an effective methodology that adapts large pretrained transformer-based foundation models for hyperspectral object tracking. We propose an adaptive, learnable spatial-spectral token fusion module that can be extended to any transformer-based backbone for learning inherent spatial-spectral features in hyperspectral data. Furthermore, our model incorporates a cross-modality training pipeline that facilitates effective learning across hyperspectral datasets collected with different sensor modalities. This enables the extraction of complementary knowledge from additional modalities, whether or not they are present during testing. Our proposed model also achieves superior performance with minimal training iterations.
