SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

Wenrui Cai; Qingjie Liu; Yunhong Wang

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

Wenrui Cai, Qingjie Liu, Yunhong Wang

TL;DR

The paper addresses the challenge that single-stream Transformer trackers struggle to model diverse patch-relations and foreground–background interactions. It introduces SPMTrack, which uses a specialized Mixture of Experts (TMoE) module embedded in both attention and FFN layers to flexibly combine multiple experts, and extends this to multi-frame spatio-temporal context for improved accuracy. TMoE enables parameter-efficient fine-tuning by freezing a shared expert and training a small set of routed experts and a router, reducing trainable parameters while preserving or enhancing performance. Empirical results across seven datasets show state-of-the-art performance with ViT backbones and limited training parameters, highlighting the method’s scalability and generalization potential for visual tracking.

Abstract

Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at https://github.com/WenRuiCai/SPMTrack.

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

TL;DR

Abstract

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)