Table of Contents
Fetching ...

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

Wenrui Cai, Qingjie Liu, Yunhong Wang

TL;DR

The paper addresses the challenge that single-stream Transformer trackers struggle to model diverse patch-relations and foreground–background interactions. It introduces SPMTrack, which uses a specialized Mixture of Experts (TMoE) module embedded in both attention and FFN layers to flexibly combine multiple experts, and extends this to multi-frame spatio-temporal context for improved accuracy. TMoE enables parameter-efficient fine-tuning by freezing a shared expert and training a small set of routed experts and a router, reducing trainable parameters while preserving or enhancing performance. Empirical results across seven datasets show state-of-the-art performance with ViT backbones and limited training parameters, highlighting the method’s scalability and generalization potential for visual tracking.

Abstract

Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at https://github.com/WenRuiCai/SPMTrack.

SPMTrack: Spatio-Temporal Parameter-Efficient Fine-Tuning with Mixture of Experts for Scalable Visual Tracking

TL;DR

The paper addresses the challenge that single-stream Transformer trackers struggle to model diverse patch-relations and foreground–background interactions. It introduces SPMTrack, which uses a specialized Mixture of Experts (TMoE) module embedded in both attention and FFN layers to flexibly combine multiple experts, and extends this to multi-frame spatio-temporal context for improved accuracy. TMoE enables parameter-efficient fine-tuning by freezing a shared expert and training a small set of routed experts and a router, reducing trainable parameters while preserving or enhancing performance. Empirical results across seven datasets show state-of-the-art performance with ViT backbones and limited training parameters, highlighting the method’s scalability and generalization potential for visual tracking.

Abstract

Most state-of-the-art trackers adopt one-stream paradigm, using a single Vision Transformer for joint feature extraction and relation modeling of template and search region images. However, relation modeling between different image patches exhibits significant variations. For instance, background regions dominated by target-irrelevant information require reduced attention allocation, while foreground, particularly boundary areas, need to be be emphasized. A single model may not effectively handle all kinds of relation modeling simultaneously. In this paper, we propose a novel tracker called SPMTrack based on mixture-of-experts tailored for visual tracking task (TMoE), combining the capability of multiple experts to handle diverse relation modeling more flexibly. Benefiting from TMoE, we extend relation modeling from image pairs to spatio-temporal context, further improving tracking accuracy with minimal increase in model parameters. Moreover, we employ TMoE as a parameter-efficient fine-tuning method, substantially reducing trainable parameters, which enables us to train SPMTrack of varying scales efficiently and preserve the generalization ability of pretrained models to achieve superior performance. We conduct experiments on seven datasets, and experimental results demonstrate that our method significantly outperforms current state-of-the-art trackers. The source code is available at https://github.com/WenRuiCai/SPMTrack.

Paper Structure

This paper contains 15 sections, 5 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison of LaSOT AUC and model parameter count across different trackers. Larger loop indicates better performance.
  • Figure 2: Overview of SPMTrack that consists of a feature extraction network and a prediction head. The main body of the feature extraction network is a Transformer encoder composed of multiple TMoEBlocks. The structure of TMoEBlock is shown on the right side.
  • Figure 3: The structure of TMoE. The symbols maintain the same meaning with Figure \ref{['fig:pipeline']}.
  • Figure 4: The performance of our method compared with other state-of-the-art trackers in terms of AUC across various scenarios in the LaSOT test split.
  • Figure 5: Comparison of t-SNE visualizations. Each column shows outputs from all compression experts (top) and routed experts (bottom) within a TMoE module. Different colors represent distinct experts. Zoom in for better view.
  • ...and 1 more figures