SUTrack: Towards Simple and Unified Single Object Tracking

Xin Chen; Ben Kang; Wanting Geng; Jiawen Zhu; Yi Liu; Dong Wang; Huchuan Lu

SUTrack: Towards Simple and Unified Single Object Tracking

Xin Chen, Ben Kang, Wanting Geng, Jiawen Zhu, Yi Liu, Dong Wang, Huchuan Lu

TL;DR

SUTrack tackles fragmentation in single object tracking by unifying RGB-based and multi-modal SOT tasks into a single model trained in one session. It introduces a unified modality representation via multi-modal patch embedding, a soft token type embedding, and a task-recognition auxiliary training objective, leveraging a CLIP-based language encoder for the RGB-Language task. The method achieves state-of-the-art or competitive results across 11 benchmarks spanning five tasks, including edge-device variants that retain speed. The work provides a solid foundation for future unified tracking research and makes code and models publicly available.

Abstract

In this paper, we propose a simple yet unified single object tracking (SOT) framework, dubbed SUTrack. It consolidates five SOT tasks (RGB-based, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a unified model trained in a single session. Due to the distinct nature of the data, current methods typically design individual architectures and train separate models for each task. This fragmentation results in redundant training processes, repetitive technological innovations, and limited cross-modal knowledge sharing. In contrast, SUTrack demonstrates that a single model with a unified input representation can effectively handle various common SOT tasks, eliminating the need for task-specific designs and separate training sessions. Additionally, we introduce a task-recognition auxiliary training strategy and a soft token type embedding to further enhance SUTrack's performance with minimal overhead. Experiments show that SUTrack outperforms previous task-specific counterparts across 11 datasets spanning five SOT tasks. Moreover, we provide a range of models catering edge devices as well as high-performance GPUs, striking a good trade-off between speed and accuracy. We hope SUTrack could serve as a strong foundation for further compelling research into unified tracking models. Code and models are available at github.com/chenxin-dlut/SUTrack.

SUTrack: Towards Simple and Unified Single Object Tracking

TL;DR

Abstract

Paper Structure (37 sections, 10 equations, 4 figures, 16 tables)

This paper contains 37 sections, 10 equations, 4 figures, 16 tables.

Introduction
Related Work
RGB-based Object Tracking
Multi-Modal Object Tracking
Unified Object Tracking Models
SUTrack
Unified Modality Representation
Soft Token Type Embedding
Task-recognition Training Strategy
Training and Inference
Experiments
Implementation Details
State-of-the-Art Comparisons
Ablation and Analysis.
Conclusion
...and 22 more sections

Figures (4)

Figure 1: Our SUTrack unifies five SOT tasks into one model with one training session.
Figure 2: Architecture of the proposed SUTrack. SUTrack unifies five SOT tasks (RGB-based, RGB-Depth, RGB-Thermal, RGB-Event, RGB-Language Tracking) into a single model. We use a unified token embedding format to represent different modalities and train a transformer-based tracking model with these embeddings. In the figure, D/T/E denote depth, thermal, and event modalities, respectively.
Figure 3: EAO rank plots on VOT2020 and VOT2022.
Figure 4: AUC scores of different attributes on LaSOT.

SUTrack: Towards Simple and Unified Single Object Tracking

TL;DR

Abstract

SUTrack: Towards Simple and Unified Single Object Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (4)