Learning Tracking Representations from Single Point Annotations
Qiangqiang Wu, Antoni B. Chan
TL;DR
This paper presents SoCL, a soft contrastive learning framework that learns tracking representations from single point annotations to greatly reduce annotation cost while maintaining or exceeding fully supervised performance. It introduces a Target Objectness Prior (TOP) map to infer target extent and generates Global Soft Templates (GST), Soft Negative Samples (SNS), and Local Soft Templates (LST) to drive end-to-end contrastive learning. The learned representations are applicable to Siamese and correlation-filter trackers, with extensions to pseudo bounding boxes for scale regression trackers like TransT. Experiments show SoCL achieves competitive or superior results at substantially lower annotation cost and demonstrates robustness to annotation noise, offering practical benefits for large-scale tracking deployment.
Abstract
Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.
