Table of Contents
Fetching ...

COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

Chunhui Zhang, Li Liu, Jialin Gao, Xin Sun, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang

TL;DR

This work addresses vision-language tracking for small objects by proposing COST, a contrastive one-stage transformer that learns unified VL representations through a contrastive alignment maximizing mutual information between video and language and a visual-linguistic transformer for cross-modal reasoning. It introduces VL-SOT500, the first large-scale multi-modal small-object tracking dataset with bounding boxes and language descriptions, including VL-SOT230 for generic small-object tracking and VL-SOT270 for high-speed scenarios. COST achieves state-of-the-art performance across five VL-tracking benchmarks and VL-SOT500, with ablations confirming the value of the linguistic branch, a learnable [OBJ] token, and the CoA module, while visualizations illustrate effective cross-modal alignment. This work demonstrates that a simple, unified transformer-based fusion can robustly handle small-object VL tracking and provides a new dataset to accelerate future research in open-vocabulary, multimodal tracking.

Abstract

Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.

COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

TL;DR

This work addresses vision-language tracking for small objects by proposing COST, a contrastive one-stage transformer that learns unified VL representations through a contrastive alignment maximizing mutual information between video and language and a visual-linguistic transformer for cross-modal reasoning. It introduces VL-SOT500, the first large-scale multi-modal small-object tracking dataset with bounding boxes and language descriptions, including VL-SOT230 for generic small-object tracking and VL-SOT270 for high-speed scenarios. COST achieves state-of-the-art performance across five VL-tracking benchmarks and VL-SOT500, with ablations confirming the value of the linguistic branch, a learnable [OBJ] token, and the CoA module, while visualizations illustrate effective cross-modal alignment. This work demonstrates that a simple, unified transformer-based fusion can robustly handle small-object VL tracking and provides a new dataset to accelerate future research in open-vocabulary, multimodal tracking.

Abstract

Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.

Paper Structure

This paper contains 26 sections, 10 equations, 21 figures, 15 tables, 1 algorithm.

Figures (21)

  • Figure 1: Comparison of VL tracking pipelines. (a) The typical VL tracking framework aggregates CNN and Transformer features heterogeneously using multi-stage multi-modal fusion. (b) Our COST performs one-stage multi-modal fusion with a contrastive transformer fusion framework in a homogeneous way and predicts the object location by a tracking head.
  • Figure 2: Some representative samples in the proposed VL-SOT500 dataset. We annotate each video sequence with bounding boxes and a language description. Small objects pose significant challenges to tracking due to less effective visual information, high-speed motion, etc. Best viewed by zooming in.
  • Figure 3: Comparison of (a) generic object tracking and (b) high-speed small object tracking. The latter poses considerably greater challenges, mainly due to the object exhibiting a reduced visual scale and increased relative speeds.
  • Figure 4: Distribution of each attribute in VL-SOT500.
  • Figure 5: Target size, and average relative speed distributions in VL-SOT500. Best viewed in color and zoomed in.
  • ...and 16 more figures