Table of Contents
Fetching ...

Unifying Visual and Vision-Language Tracking via Contrastive Learning

Yinchao Ma, Yuyang Tang, Wenfei Yang, Tianzhu Zhang, Jinpeng Zhang, Mengxue Kang

TL;DR

UVLTrack addresses the challenge of unifying visual and vision-language tracking across BBOX, NL, and NL+BBOX by introducing a modality-unified feature extractor and a modality-adaptive box head. It aligns vision and language into a common semantic space using a multi-modal contrastive loss and leverages a distribution-based cross-attention mechanism in a dynamic head to exploit context from video histories, formalized by the training objective $L = L_{tgt} + L_{cls} + L_{box} + λ_{mmc} Σ_{i=1}^{N+M} L_{mmc}^i$. Empirically, UVLTrack achieves strong performance on seven visual tracking benchmarks, three vision-language datasets, and three visual grounding datasets, with UVLTrack-L attaining state-of-the-art results and UVLTrack-B offering higher speed. This framework enables flexible target references and demonstrates practical impact for multi-modal tracking and grounding tasks, with code to be released.

Abstract

Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.

Unifying Visual and Vision-Language Tracking via Contrastive Learning

TL;DR

UVLTrack addresses the challenge of unifying visual and vision-language tracking across BBOX, NL, and NL+BBOX by introducing a modality-unified feature extractor and a modality-adaptive box head. It aligns vision and language into a common semantic space using a multi-modal contrastive loss and leverages a distribution-based cross-attention mechanism in a dynamic head to exploit context from video histories, formalized by the training objective . Empirically, UVLTrack achieves strong performance on seven visual tracking benchmarks, three vision-language datasets, and three visual grounding datasets, with UVLTrack-L attaining state-of-the-art results and UVLTrack-B offering higher speed. This framework enables flexible target references and demonstrates practical impact for multi-modal tracking and grounding tasks, with code to be released.

Abstract

Single object tracking aims to locate the target object in a video sequence according to the state specified by different modal references, including the initial bounding box (BBOX), natural language (NL), or both (NL+BBOX). Due to the gap between different modalities, most existing trackers are designed for single or partial of these reference settings and overspecialize on the specific modality. Differently, we present a unified tracker called UVLTrack, which can simultaneously handle all three reference settings (BBOX, NL, NL+BBOX) with the same parameters. The proposed UVLTrack enjoys several merits. First, we design a modality-unified feature extractor for joint visual and language feature learning and propose a multi-modal contrastive loss to align the visual and language features into a unified semantic space. Second, a modality-adaptive box head is proposed, which makes full use of the target reference to mine ever-changing scenario features dynamically from video contexts and distinguish the target in a contrastive way, enabling robust performance in different reference settings. Extensive experimental results demonstrate that UVLTrack achieves promising performance on seven visual tracking datasets, three vision-language tracking datasets, and three visual grounding datasets. Codes and models will be open-sourced at https://github.com/OpenSpaceAI/UVLTrack.
Paper Structure (15 sections, 10 equations, 6 figures, 9 tables)

This paper contains 15 sections, 10 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Comparison between previous solutions and UVLTrack. BBOX, NL, NL+BBOX tracker means the tracker is designed to utilize the bounding box, natural language, or both as the target reference respectively. Our UVLTrack can simultaneously handle three different reference settings.
  • Figure 2: A unified tracking framework for different target references. NA means "not available". Natural language is not available for visual tracking task and template is not available for grounding task. Different from previous trackers designed for specific reference modalities, our UVLTrack can simultaneously handle all target reference settings (BBOX, NL, NL+BBOX).
  • Figure 3: The attention mask of task-oriented multi-head attention for different target references.
  • Figure 4: The diagram of the multi-modal contrastive loss.
  • Figure 5: (a) shows the out-box similarity statistics. (b) shows the structure of the distribution-based cross-attention. (c) shows the schematic of the modality-adaptive box head, which can make full use of reference information to discriminate the target.
  • ...and 1 more figures