Table of Contents
Fetching ...

UniSOT: A Unified Framework for Multi-Modality Single Object Tracking

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Xu Zhou, Feng Wu

TL;DR

UniSOT addresses the practical need for a single tracker that can operate with multiple reference modalities (NL, BBOX, NL+BBOX) and multiple video modalities (RGB, RGB+Depth, RGB+Thermal, RGB+Event) using a unified parameter set. It introduces a reference-generalized feature extractor with a multi-modal contrastive loss and a reference-adaptive box head to stabilize localization across references, plus RAMA to jointly learn video-modality-aligned and modality-specific features within AMTBs. The two-stage training paradigm—RGB-pretraining followed by RGB+X fine-tuning with modality-shared and modality-specific rank allocations—enables seamless incremental learning of new modalities. Extensive experiments across 18 benchmarks demonstrate superior performance over modality-specific trackers and robust cross-modal generalization, with practical inference speeds and clear visualization support for the proposed mechanisms.

Abstract

Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0\% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0\% main metric across all three RGB+X video modalities.

UniSOT: A Unified Framework for Multi-Modality Single Object Tracking

TL;DR

UniSOT addresses the practical need for a single tracker that can operate with multiple reference modalities (NL, BBOX, NL+BBOX) and multiple video modalities (RGB, RGB+Depth, RGB+Thermal, RGB+Event) using a unified parameter set. It introduces a reference-generalized feature extractor with a multi-modal contrastive loss and a reference-adaptive box head to stabilize localization across references, plus RAMA to jointly learn video-modality-aligned and modality-specific features within AMTBs. The two-stage training paradigm—RGB-pretraining followed by RGB+X fine-tuning with modality-shared and modality-specific rank allocations—enables seamless incremental learning of new modalities. Extensive experiments across 18 benchmarks demonstrate superior performance over modality-specific trackers and robust cross-modal generalization, with practical inference speeds and clear visualization support for the proposed mechanisms.

Abstract

Single object tracking aims to localize target object with specific reference modalities (bounding box, natural language or both) in a sequence of specific video modalities (RGB, RGB+Depth, RGB+Thermal or RGB+Event.). Different reference modalities enable various human-machine interactions, and different video modalities are demanded in complex scenarios to enhance tracking robustness. Existing trackers are designed for single or several video modalities with single or several reference modalities, which leads to separate model designs and limits practical applications. Practically, a unified tracker is needed to handle various requirements. To the best of our knowledge, there is still no tracker that can perform tracking with these above reference modalities across these video modalities simultaneously. Thus, in this paper, we present a unified tracker, UniSOT, for different combinations of three reference modalities and four video modalities with uniform parameters. Extensive experimental results on 18 visual tracking, vision-language tracking and RGB+X tracking benchmarks demonstrate that UniSOT shows superior performance against modality-specific counterparts. Notably, UniSOT outperforms previous counterparts by over 3.0\% AUC on TNL2K across all three reference modalities and outperforms Un-Track by over 2.0\% main metric across all three RGB+X video modalities.

Paper Structure

This paper contains 27 sections, 17 figures, 21 tables.

Figures (17)

  • Figure 1: Comparison between previous solutions and UniSOT. There are various reference and video modalities in single object tracking tasks for different application scenarios. However, previous trackers are commonly tailored for a specific reference or video modality. (a) BBOX, NL, NL+BBOX trackers utilize the bounding box(BBOX), natural language(NL), or both(NL+BBOX) as target object reference to track on RGB sequences. (b) RGB, RGB+Depth, RGB+Thermal, RGB+Event trackers are designed for tracking in sequences of corresponding video modality with BBOX reference. (c) Unlike the customized solutions for specific reference or video modalities, we seek to design a unified tracker (UniSOT) which can perform tracking with different combinations of three reference modalities and four video modalities using uniform parameters, enabling generalized capability.
  • Figure 2: A unified tracking framework for different target references. NA means "not available", which is filled with zeros. Natural language is not available for visual tracking task, and the template is not available for grounding task. Different from previous trackers designed for specific reference or video modalities, our UniSOT can simultaneously handle different combinations of reference and video modalities in a unified framework.
  • Figure 3: The attention mask of task-oriented multi-head attention for different target references. It enables different reference modalities to be trained jointly.
  • Figure 4: The diagram of the multi-modal contrastive loss. We align different reference modalities with target object patch feature by contrasting hard background patches.
  • Figure 5: (a) shows the schematic of the reference-adaptive box head, which can make full use of reference information to discriminate target object. (b) shows the structure of the distribution-based cross-attention. (c) illustrates the descending order of similarity between the semantic token and template/context background patches, which can be regarded as the probability of background features being mistakenly classified as the target object. The cumulative probability distribution clarifies how we apply the threshold $\beta$ to separate distractor features from background features.
  • ...and 12 more figures