COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

Chunhui Zhang; Li Liu; Jialin Gao; Xin Sun; Hao Wen; Xi Zhou; Shiming Ge; Yanfeng Wang

COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

Chunhui Zhang, Li Liu, Jialin Gao, Xin Sun, Hao Wen, Xi Zhou, Shiming Ge, Yanfeng Wang

TL;DR

This work addresses vision-language tracking for small objects by proposing COST, a contrastive one-stage transformer that learns unified VL representations through a contrastive alignment maximizing mutual information between video and language and a visual-linguistic transformer for cross-modal reasoning. It introduces VL-SOT500, the first large-scale multi-modal small-object tracking dataset with bounding boxes and language descriptions, including VL-SOT230 for generic small-object tracking and VL-SOT270 for high-speed scenarios. COST achieves state-of-the-art performance across five VL-tracking benchmarks and VL-SOT500, with ablations confirming the value of the linguistic branch, a learnable [OBJ] token, and the CoA module, while visualizations illustrate effective cross-modal alignment. This work demonstrates that a simple, unified transformer-based fusion can robustly handle small-object VL tracking and provides a new dataset to accelerate future research in open-vocabulary, multimodal tracking.

Abstract

Transformer has recently demonstrated great potential in improving vision-language (VL) tracking algorithms. However, most of the existing VL trackers rely on carefully designed mechanisms to perform the multi-stage multi-modal fusion. Additionally, direct multi-modal fusion without alignment ignores distribution discrepancy between modalities in feature space, potentially leading to suboptimal representations. In this work, we propose COST, a contrastive one-stage transformer fusion framework for VL tracking, aiming to learn semantically consistent and unified VL representations. Specifically, we introduce a contrastive alignment strategy that maximizes mutual information (MI) between a video and its corresponding language description. This enables effective cross-modal alignment, yielding semantically consistent features in the representation space. By leveraging a visual-linguistic transformer, we establish an efficient multi-modal fusion and reasoning mechanism, empirically demonstrating that a simple stack of transformer encoders effectively enables unified VL representations. Moreover, we contribute a newly collected VL tracking benchmark dataset for small object tracking, named VL-SOT500, with bounding boxes and language descriptions. Our dataset comprises two challenging subsets, VL-SOT230 and VL-SOT270, dedicated to evaluating generic and high-speed small object tracking, respectively. Small object tracking is notoriously challenging due to weak appearance and limited features, and this dataset is, to the best of our knowledge, the first to explore the usage of language cues to enhance visual representation for small object tracking. Extensive experiments demonstrate that COST achieves state-of-the-art performance on five existing VL tracking datasets, as well as on our proposed VL-SOT500 dataset. Source codes and dataset will be made publicly available.

COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

TL;DR

Abstract

COST: Contrastive One-Stage Transformer for Vision-Language Small Object Tracking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (21)