Single-Model and Any-Modality for Video Object Tracking

Zongwei Wu; Jilai Zheng; Xiangxuan Ren; Florin-Alexandru Vasluianu; Chao Ma; Danda Pani Paudel; Luc Van Gool; Radu Timofte

Single-Model and Any-Modality for Video Object Tracking

Zongwei Wu, Jilai Zheng, Xiangxuan Ren, Florin-Alexandru Vasluianu, Chao Ma, Danda Pani Paudel, Luc Van Gool, Radu Timofte

TL;DR

This paper tackles the problem of video object tracking across heterogeneous modalities by proposing Un-Track, a unified tracker that uses a single parameter set to handle any RGB-X input. The core idea is to learn a shared embedding across modalities through low-rank factorization guided by explicit edge priors, complemented by a cross-modal prompting mechanism and LoRA-based finetuning of a pretrained RGB tracker. The approach enables robust cross-modal alignment without requiring all modalities to co-occur during training, and achieves substantial gains over modality-specific and prior unified trackers across five benchmarks, with modest computational overhead. The work demonstrates practical impact by enabling a versatile, resource-efficient tracking model that remains effective even when auxiliary modalities are missing or vary across deployments, and it provides a public code release for reproducibility.

Abstract

In the realm of video object tracking, auxiliary modalities such as depth, thermal, or event data have emerged as valuable assets to complement the RGB trackers. In practice, most existing RGB trackers learn a single set of parameters to use them across datasets and applications. However, a similar single-model unification for multi-modality tracking presents several challenges. These challenges stem from the inherent heterogeneity of inputs -- each with modality-specific representations, the scarcity of multi-modal datasets, and the absence of all the modalities at all times. In this work, we introduce Un-Track, a Unified Tracker of a single set of parameters for any modality. To handle any modality, our method learns their common latent space through low-rank factorization and reconstruction techniques. More importantly, we use only the RGB-X pairs to learn the common latent space. This unique shared representation seamlessly binds all modalities together, enabling effective unification and accommodating any missing modality, all within a single transformer-based architecture. Our Un-Track achieves +8.1 absolute F-score gain, on the DepthTrack dataset, by introducing only +2.14 (over 21.50) GFLOPs with +6.6M (over 93M) parameters, through a simple yet efficient prompting strategy. Extensive comparisons on five benchmark datasets with different modalities show that Un-Track surpasses both SOTA unified trackers and modality-specific counterparts, validating our effectiveness and practicality. The source code is publicly available at https://github.com/Zongwei97/UnTrack.

Single-Model and Any-Modality for Video Object Tracking

TL;DR

Abstract

Paper Structure (13 sections, 8 equations, 6 figures, 9 tables, 1 algorithm)

This paper contains 13 sections, 8 equations, 6 figures, 9 tables, 1 algorithm.

Introduction
Related Works
Methods
Overall Framework
Shared Embedding
Outer Modal Prompting
Inner Finetuning
Experiments
Training Data
Within distribution Evaluation
Generalization Across Datasets
Ablation Studies
Conclusion and Future Work

Figures (6)

Figure 1: Un-Track is a unified tracker with a single parameter set that seamlessly integrates any modality (of RGB-X).
Figure 2: Our proposed framework, termed Un-Track, is composed of a shared embedding, a modal prompting, and a LoRA-finetuned pretrained RGB tracker. The shared embedding learns a joint representation that unifies all modalities (\ref{['sec:recons']}). The modal prompting block enhances feature modeling with modal awareness at each scale (\ref{['sec:prompt']}). To track the target object, we finetune a pretrained foundation model ostrack using the LoRA technique (\ref{['sec:lora']}). Our model achieves a unified model applicable across different modalities under a single parameter set. During inference, Un-Track seamlessly integrates any image-paired data, thanks to the emergent alignment.
Figure 3: Shared Embedding. We derive a joint representation through low-rank factorization and reconstruction. Such an implicit learning is additionally integrated with explicit edge awareness to enhance the embedding.
Figure 4: Modal Prompting. For the visual feature $I$, we employ a score function to categorize tokens into negative, uncertain, and positive segments. Using a token exchange policy, we discard negative tokens, enhance uncertain ones with corresponding tokens from $F$, and retain positive ones. Then, we transform the feature fusion task into a token recovery problem, addressed by low-rank factorization. Similarly, we extract the most informative low-rank matrix from $F$ to fuse and reconstruct the shared output.
Figure 5: More precision/success comparisons on VisEvent dataset visevent. "Uni" stands for models with a single parameter set. "_E" stands for the extension of RGB trackers with event fusion.
...and 1 more figures

Single-Model and Any-Modality for Video Object Tracking

TL;DR

Abstract

Single-Model and Any-Modality for Video Object Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)