Table of Contents
Fetching ...

LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking

Yunfeng Li, Bo Wang, Ye Li

TL;DR

LightFC-X tackles the challenge of resource-intensive multimodal tracking by introducing a unified, lightweight RGB-X tracker built on two key modules: Efficient Cross-Attention Module (ECAM) for cross-modal integration and Spatiotemporal Template Aggregation Module (STAM) for temporal refinement via a two-phase training strategy. The approach preserves accuracy while drastically reducing parameters and increasing speed, achieving state-of-the-art results on multiple benchmarks (including RGB-T, RGB-D, RGB-E, and RGB-S) and delivering real-time CPU performance. The main contributions are the 0.08M-parameter ECAM, the STAM-based temporal aggregation with a module-fine-tuning paradigm, and the demonstration of practical deployment potential across edge devices. This work advances lightweight multimodal tracking by enabling efficient cross-modal feature interaction and temporal modeling without sacrificing performance.

Abstract

Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.

LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking

TL;DR

LightFC-X tackles the challenge of resource-intensive multimodal tracking by introducing a unified, lightweight RGB-X tracker built on two key modules: Efficient Cross-Attention Module (ECAM) for cross-modal integration and Spatiotemporal Template Aggregation Module (STAM) for temporal refinement via a two-phase training strategy. The approach preserves accuracy while drastically reducing parameters and increasing speed, achieving state-of-the-art results on multiple benchmarks (including RGB-T, RGB-D, RGB-E, and RGB-S) and delivering real-time CPU performance. The main contributions are the 0.08M-parameter ECAM, the STAM-based temporal aggregation with a module-fine-tuning paradigm, and the demonstration of practical deployment potential across edge devices. This work advances lightweight multimodal tracking by enabling efficient cross-modal feature interaction and temporal modeling without sacrificing performance.

Abstract

Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.

Paper Structure

This paper contains 17 sections, 16 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Comparisons with state-of-the-art lightweight RGB-T trackers in terms of success rate, speed and parameters on LasHeR lasher benchmark. The proposed LightFC-T is superior than TBSI-Tiny tbsi, CMD cmd, and MANet++ manetpp, while using much fewer parameters.
  • Figure 2: The overall framework of LightFC-X. It first obtains a spatiotemporal template feature of the target by aggregating the template and the dynamic template features. It then uses the TSAIM module to obtain two single-modality features. Our proposed ECAM module achieves lightweight interaction of template-search area integrated features. The cross-modal integrated feature is concatenated with the search area features from the two modalities and subsequently fed into the prediction head. During training, we first train the model that includes the ECAM module but excludes the STAM module. Afterward, we freeze the model and finetune the STAM module.
  • Figure 3: Illustration of our proposed ECAM module. It contains a lightweight cross-attention layer and a joint feature encoding module.
  • Figure 4: Illustration of our proposed STAM module. It contains a lightweight cross-attention layer, two feature refinement modules, and a linear transformation layer.
  • Figure 5: Different variants of LightFC-X. "TSAIM" denotes the template-search area interaction module. (a) Integrating multimodal features in the Backbone. (b-d) Integrating multimodal features before TSAIM module. (e-f) Integrating multimodal features after TSAIM module.
  • ...and 1 more figures