Table of Contents
Fetching ...

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

Chunhui Zhang, Xin Sun, Yiqian Yang, Li Liu, Qiong Liu, Xi Zhou, Yanfeng Wang

TL;DR

All-in-One introduces a unified vision-language tracking framework that embeds visual template, search region, and language prompts into a single transformer backbone. A modal mixup injects language information into vision embeddings, enabling bidirectional cross-modal interaction, while the multi-modal alignment MMA module (comprising cross-modal CMA and intra-modal IMA) regularizes representations with InfoNCE-based contrastive losses. Empirically, the approach achieves state-of-the-art results across five VL tracking benchmarks and runs at about 60 FPS, demonstrating both accuracy and efficiency gains over prior two-stream and even some one-stream methods. The work advocates a foundation-model-like direction for VL tracking, reducing fusion complexity and improving generalization to diverse scenes and prompts.

Abstract

Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT$_{\rm Ext}$ and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available at https://github.com/983632847/All-in-One.

All in One: Exploring Unified Vision-Language Tracking with Multi-Modal Alignment

TL;DR

All-in-One introduces a unified vision-language tracking framework that embeds visual template, search region, and language prompts into a single transformer backbone. A modal mixup injects language information into vision embeddings, enabling bidirectional cross-modal interaction, while the multi-modal alignment MMA module (comprising cross-modal CMA and intra-modal IMA) regularizes representations with InfoNCE-based contrastive losses. Empirically, the approach achieves state-of-the-art results across five VL tracking benchmarks and runs at about 60 FPS, demonstrating both accuracy and efficiency gains over prior two-stream and even some one-stream methods. The work advocates a foundation-model-like direction for VL tracking, reducing fusion complexity and improving generalization to diverse scenes and prompts.

Abstract

Current mainstream vision-language (VL) tracking framework consists of three parts, \ie a visual feature extractor, a language feature extractor, and a fusion model. To pursue better performance, a natural modus operandi for VL tracking is employing customized and heavier unimodal encoders, and multi-modal fusion models. Albeit effective, existing VL trackers separate feature extraction and feature integration, resulting in extracted features that lack semantic guidance and have limited target-aware capability in complex scenarios, \eg similar distractors and extreme illumination. In this work, inspired by the recent success of exploring foundation models with unified architecture for both natural language and computer vision tasks, we propose an All-in-One framework, which learns joint feature extraction and interaction by adopting a unified transformer backbone. Specifically, we mix raw vision and language signals to generate language-injected vision tokens, which we then concatenate before feeding into the unified backbone architecture. This approach achieves feature integration in a unified backbone, removing the need for carefully-designed fusion modules and resulting in a more effective and efficient VL tracking framework. To further improve the learning efficiency, we introduce a multi-modal alignment module based on cross-modal and intra-modal contrastive objectives, providing more reasonable representations for the unified All-in-One transformer backbone. Extensive experiments on five benchmarks, \ie OTB99-L, TNL2K, LaSOT, LaSOT and WebUAV-3M, demonstrate the superiority of the proposed tracker against existing state-of-the-arts on VL tracking. Codes will be made publicly available at https://github.com/983632847/All-in-One.
Paper Structure (18 sections, 12 equations, 8 figures, 4 tables)

This paper contains 18 sections, 12 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Existing VL tracking framework vs. our All-in-One. (a) Existing VL tracking methods obtain multiple modality features from separate extractors before fusion. The feature interaction relies on a carefully-designed fusion model. (b) We aim to build a foundation model, i.e., All-in-One, for VL tracking, which achieves joint feature extraction and multi-modal interaction using a versatile transformer encoder.
  • Figure 2: Overview of the proposed All-in-One framework. The multi-modal alignment module is introduced before the All-in-One transformer backbone to align visual and language embeddings in the feature space. The All-in-One transformer backbone is applied to achieve joint feature extraction and interaction. The tracking head is used to predict object location.
  • Figure 3: Illustration of MMA module, which contains CMA and IMA. For CMA only, the second vision embedding (yellow star) is pulled towards its matched language embedding (green star). By incorporating IMA, it can learn more reasonable embedding (yellow square to green square).
  • Figure 4: Visualization for revealing the target-aware capability of the All-in-One framework. "AOT" denotes our approach only with All-in-One transformer, "AOT+MMA" denotes our approach with both All-in-One transformer and multi-modal alignment module.
  • Figure 5: Analysis of the effect of ambiguous language prompts on the LaSOT test set. $^{*}$ indicates that our approach is tested with a clear sentence prompt or a clear class prompt.
  • ...and 3 more figures