LightFC-X: Lightweight Convolutional Tracker for RGB-X Tracking
Yunfeng Li, Bo Wang, Ye Li
TL;DR
LightFC-X tackles the challenge of resource-intensive multimodal tracking by introducing a unified, lightweight RGB-X tracker built on two key modules: Efficient Cross-Attention Module (ECAM) for cross-modal integration and Spatiotemporal Template Aggregation Module (STAM) for temporal refinement via a two-phase training strategy. The approach preserves accuracy while drastically reducing parameters and increasing speed, achieving state-of-the-art results on multiple benchmarks (including RGB-T, RGB-D, RGB-E, and RGB-S) and delivering real-time CPU performance. The main contributions are the 0.08M-parameter ECAM, the STAM-based temporal aggregation with a module-fine-tuning paradigm, and the demonstration of practical deployment potential across edge devices. This work advances lightweight multimodal tracking by enabling efficient cross-modal feature interaction and temporal modeling without sacrificing performance.
Abstract
Despite great progress in multimodal tracking, these trackers remain too heavy and expensive for resource-constrained devices. To alleviate this problem, we propose LightFC-X, a family of lightweight convolutional RGB-X trackers that explores a unified convolutional architecture for lightweight multimodal tracking. Our core idea is to achieve lightweight cross-modal modeling and joint refinement of the multimodal features and the spatiotemporal appearance features of the target. Specifically, we propose a novel efficient cross-attention module (ECAM) and a novel spatiotemporal template aggregation module (STAM). The ECAM achieves lightweight cross-modal interaction of template-search area integrated feature with only 0.08M parameters. The STAM enhances the model's utilization of temporal information through module fine-tuning paradigm. Comprehensive experiments show that our LightFC-X achieves state-of-the-art performance and the optimal balance between parameters, performance, and speed. For example, LightFC-T-ST outperforms CMD by 4.3% and 5.7% in SR and PR on the LasHeR benchmark, which it achieves 2.6x reduction in parameters and 2.7x speedup. It runs in real-time on the CPU at a speed of 22 fps. The code is available at https://github.com/LiYunfengLYF/LightFC-X.
