Table of Contents
Fetching ...

Lightweight RGB-T Tracking with Mobile Vision Transformers

Mahdi Falaki, Maria A. Amer

TL;DR

This work addresses robust RGB–T tracking under challenging conditions by introducing a lightweight tracker built on MobileViT with progressive intra- to inter-modal fusion via separable mixed attention. The architecture combines a mmMobileViT backbone, a PW-XCorr neck, a cross-modal fusion transformer, and a SMAT-style head, trained with a focal classification loss and a regression loss, achieving a model with under 4M parameters and real-time inference (GPU ~122 FPS, CPU ~25.7 FPS). Key contributions include the first MobileViT–based multimodal tracker, a progressive fusion strategy that delays inter-modal interaction until deeper stages, and strong practical performance on resource-constrained devices. The approach yields a favorable accuracy–efficiency trade-off suitable for embedded and mobile deployments, with future work planned on token pruning and extending to additional modalities such as RGB–D.

Abstract

Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.

Lightweight RGB-T Tracking with Mobile Vision Transformers

TL;DR

This work addresses robust RGB–T tracking under challenging conditions by introducing a lightweight tracker built on MobileViT with progressive intra- to inter-modal fusion via separable mixed attention. The architecture combines a mmMobileViT backbone, a PW-XCorr neck, a cross-modal fusion transformer, and a SMAT-style head, trained with a focal classification loss and a regression loss, achieving a model with under 4M parameters and real-time inference (GPU ~122 FPS, CPU ~25.7 FPS). Key contributions include the first MobileViT–based multimodal tracker, a progressive fusion strategy that delays inter-modal interaction until deeper stages, and strong practical performance on resource-constrained devices. The approach yields a favorable accuracy–efficiency trade-off suitable for embedded and mobile deployments, with future work planned on token pruning and extending to additional modalities such as RGB–D.

Abstract

Single-modality tracking (RGB-only) struggles under low illumination, weather, and occlusion. Multimodal tracking addresses this by combining complementary cues. While Vision Transformer-based trackers achieve strong accuracy, they are often too large for real-time. We propose a lightweight RGB-T tracker built on MobileViT with a progressive fusion framework that models intra- and inter-modal interactions using separable mixed attention. This design delivers compact, effective features for accurate localization, with under 4M parameters and real-time performance of 25.7 FPS on the CPU and 122 FPS on the GPU, supporting embedded and mobile platforms. To the best of our knowledge, this is the first MobileViT-based multimodal tracker. Model code and weights are available in the GitHub repository.

Paper Structure

This paper contains 13 sections, 4 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The pipeline of proposed RGB-T tracker. MV2 stands for MobileNetV2 (Inverted Residual blocks) and mmMobileViT for multimodal MobileViT (see Figure \ref{['fig:mmMobileViT']}). $\downarrow 2$ indicates spatial downsampling by 2. {$X^{\mathrm{IR}}$, $Z^{\mathrm{IR}}$} show the input search and template frames of Thermal Infrared Modality (IR). $\times 3$ shows the number of subsequent MV2 blocks in layer_2.
  • Figure 2: mmMobileViT: Layer 3 uses intra-modal separable mixed attention; Layer 4 uses inter-modal. $L$: transformer layers per block.
  • Figure 3: Tracking on two GTOT gtot sequences comparing RGB-only (upper) and RGB-T (lower), as in Table 3. Red: predictions; green: ground truth. RainyCar2: rainy weather; WalkingOcc: partial occlusion.