Table of Contents
Fetching ...

UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking

He Wang, Tianyang Xu, Zhangyong Tang, Xiao-Jun Wu, Josef Kittler

TL;DR

UASTrack tackles the lack of modality-adaptive perception in single-object tracking by introducing a lightweight Discriminative Auto-Selector (DAS) to identify the input modality and a Task-Customized Optimization Adapter (TCOA) to tailor the network for RGB-T, RGB-D, and RGB-E tasks within a single model and parameter set. The framework freezes an RGB-based Transformer backbone while learning modality-specific adapters and a modality-aware selection mechanism, enabling robust cross-modal fusion without modality priors. Key contributions include the DAS with Classification Constraint Loss and the modality-specific MASA/VA adapters, which together enable adaptive processing and noise reduction per modality. Empirical results on five benchmarks (LasHeR, GTOT, RGBT234, VisEvent, DepthTrack) demonstrate competitive or state-of-the-art performance with only 1.87M additional parameters and 1.95G FLOPs, highlighting strong efficiency and practical impact for real-world multi-modal tracking scenarios.

Abstract

Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. To achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a Task-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at https://github.com/wanghe/UASTrack.

UASTrack: A Unified Adaptive Selection Framework with Modality-Customization in Single Object Tracking

TL;DR

UASTrack tackles the lack of modality-adaptive perception in single-object tracking by introducing a lightweight Discriminative Auto-Selector (DAS) to identify the input modality and a Task-Customized Optimization Adapter (TCOA) to tailor the network for RGB-T, RGB-D, and RGB-E tasks within a single model and parameter set. The framework freezes an RGB-based Transformer backbone while learning modality-specific adapters and a modality-aware selection mechanism, enabling robust cross-modal fusion without modality priors. Key contributions include the DAS with Classification Constraint Loss and the modality-specific MASA/VA adapters, which together enable adaptive processing and noise reduction per modality. Empirical results on five benchmarks (LasHeR, GTOT, RGBT234, VisEvent, DepthTrack) demonstrate competitive or state-of-the-art performance with only 1.87M additional parameters and 1.95G FLOPs, highlighting strong efficiency and practical impact for real-world multi-modal tracking scenarios.

Abstract

Multi-modal tracking is essential in single-object tracking (SOT), as different sensor types contribute unique capabilities to overcome challenges caused by variations in object appearance. However, existing unified RGB-X trackers (X represents depth, event, or thermal modality) either rely on the task-specific training strategy for individual RGB-X image pairs or fail to address the critical importance of modality-adaptive perception in real-world applications. In this work, we propose UASTrack, a unified adaptive selection framework that facilitates both model and parameter unification, as well as adaptive modality discrimination across various multi-modal tracking tasks. To achieve modality-adaptive perception in joint RGB-X pairs, we design a Discriminative Auto-Selector (DAS) capable of identifying modality labels, thereby distinguishing the data distributions of auxiliary modalities. Furthermore, we propose a Task-Customized Optimization Adapter (TCOA) tailored to various modalities in the latent space. This strategy effectively filters noise redundancy and mitigates background interference based on the specific characteristics of each modality. Extensive comparisons conducted on five benchmarks including LasHeR, GTOT, RGBT234, VisEvent, and DepthTrack, covering RGB-T, RGB-E, and RGB-D tracking scenarios, demonstrate our innovative approach achieves comparative performance by introducing only additional training parameters of 1.87M and flops of 1.95G. The code will be available at https://github.com/wanghe/UASTrack.

Paper Structure

This paper contains 15 sections, 8 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: A comparison between our unified tracker and previous modality-specific trackers. (a) N tasks with N models. (b) N tasks with one model but N sets of training parameters. (c) Our proposed method, UASTrack. UASTrack is a unified multi-modal tracker utilizing both a single model architecture and a single set of trainable parameters to dynamically accommodate any modality within the RGB-X sensory input. UASTrack captures distinct modality inputs and applies modality-specific processing tailored to their unique characteristics, marking the first achievement of this capability in an RGB-X tracker. The metric "PSR" (Prediction Success Rate) quantifies the tracker's capability to dynamically adjust to modality variations while maintaining robust recognition performance.
  • Figure 2: Illustraction of our proposed UASTrack.
  • Figure 3: Illustraction of the proposed Task-Customized Optimization Adapter.
  • Figure 4: The Success Rate (SR) and Precision Rate (PR) of 19 different attributes on LasHeR dataset.
  • Figure 5: Ablation study with visualized score map comparisons of our proposed method. "w/o TCOA," represents UASTrack without the TCOA module; "TCOA with linear" represents the TCOA module exclusively employs linear layers; and "TCOA with AvgPool+MaxPool" represents the TCOA module integrates both average pooling and max pooling operations.
  • ...and 2 more figures