Lightweight RGB-T Tracking with Mobile Vision Transformers
Mahdi Falaki, Maria A. Amer
TL;DR
This work addresses robust RGB–T tracking under challenging conditions by introducing a lightweight tracker built on MobileViT with progressive intra- to inter-modal fusion via separable mixed attention. The architecture combines a mmMobileViT backbone, a PW-XCorr neck, a cross-modal fusion transformer, and a SMAT-style head, trained with a focal classification loss and a regression loss, achieving a model with under 4M parameters and real-time inference (GPU ~122 FPS, CPU ~25.7 FPS). Key contributions include the first MobileViT–based multimodal tracker, a progressive fusion strategy that delays inter-modal interaction until deeper stages, and strong practical performance on resource-constrained devices. The approach yields a favorable accuracy–efficiency trade-off suitable for embedded and mobile deployments, with future work planned on token pruning and extending to additional modalities such as RGB–D.
Abstract
Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.
