Unifying Visual and Vision-Language Tracking via Contrastive Learning
Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang
TL;DR
UVLTrack addresses the challenge of unifying visual and vision-language tracking across BBOX, NL, and NL+BBOX by introducing a modality-unified feature extractor and a modality-adaptive box head. It aligns vision and language into a common semantic space using a multi-modal contrastive loss and leverages a distribution-based cross-attention mechanism in a dynamic head to exploit context from video histories, formalized by the training objective $L = L_{tgt} + L_{cls} + L_{box} + λ_{mmc} Σ_{i=1}^{N+M} L_{mmc}^i$. Empirically, UVLTrack achieves strong performance on seven visual tracking benchmarks, three vision-language datasets, and three visual grounding datasets, with UVLTrack-L attaining state-of-the-art results and UVLTrack-B offering higher speed. This framework enables flexible target references and demonstrates practical impact for multi-modal tracking and grounding tasks, with code to be released.
Abstract
Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.
