Learning Tracking Representations from Single Point Annotations

Qiangqiang Wu; Antoni B. Chan

Learning Tracking Representations from Single Point Annotations

Qiangqiang Wu, Antoni B. Chan

TL;DR

This paper presents SoCL, a soft contrastive learning framework that learns tracking representations from single point annotations to greatly reduce annotation cost while maintaining or exceeding fully supervised performance. It introduces a Target Objectness Prior (TOP) map to infer target extent and generates Global Soft Templates (GST), Soft Negative Samples (SNS), and Local Soft Templates (LST) to drive end-to-end contrastive learning. The learned representations are applicable to Siamese and correlation-filter trackers, with extensions to pseudo bounding boxes for scale regression trackers like TransT. Experiments show SoCL achieves competitive or superior results at substantially lower annotation cost and demonstrates robustness to annotation noise, offering practical benefits for large-scale tracking deployment.

Abstract

Existing deep trackers are typically trained with largescale video frames with annotated bounding boxes. However, these bounding boxes are expensive and time-consuming to annotate, in particular for large scale datasets. In this paper, we propose to learn tracking representations from single point annotations (i.e., 4.5x faster to annotate than the traditional bounding box) in a weakly supervised manner. Specifically, we propose a soft contrastive learning (SoCL) framework that incorporates target objectness prior into end-to-end contrastive learning. Our SoCL consists of adaptive positive and negative sample generation, which is memory-efficient and effective for learning tracking representations. We apply the learned representation of SoCL to visual tracking and show that our method can 1) achieve better performance than the fully supervised baseline trained with box annotations under the same annotation time cost; 2) achieve comparable performance of the fully supervised baseline by using the same number of training frames and meanwhile reducing annotation time cost by 78% and total fees by 85%; 3) be robust to annotation noise.

Learning Tracking Representations from Single Point Annotations

TL;DR

Abstract

Paper Structure (26 sections, 10 equations, 11 figures, 10 tables)

This paper contains 26 sections, 10 equations, 11 figures, 10 tables.

Introduction
Related Work
Proposed Method
Target Objectness Prior (TOP) Map
Soft Contrastive Learning
Global Soft Template (GST) Generation
Soft Negative Sample (SNS) Generation
Local Soft Template (LST) Generation
Soft Contrastive Learning Loss
Tracking Applications
Experiments
Implementation Details
Ablation Study
Comparison with Same Annotation Time Cost
Improving correlation-filter trackers
...and 11 more sections

Figures (11)

Figure 1: An illustration of video frame annotations using masks, bounding boxes and center points. The time for humans to label point annotations is $4.5\times$ and $34.4\times$ faster than the time for bounding boxes and mask annotations, respectively. In this paper, we propose a novel soft contrastive learning framework to learn tracking representations from point annotations in video frames so as to reduce annotation cost and total fees.
Figure 2: Overview of (a) global soft template generation (GST); and (b) soft negative sample (SNS) generation in the proposed SoCL framework. (a) Given two randomly selected frames ${\cal I}_{i}$ and ${\cal I}_{j}$ in a video, we firstly extract their context features $f({\cal I}_{i})$ and $f({\cal I}_{j})$, and then calculate GSTs $\mathbf{z}_i$ and $\mathbf{z}_j$ as the weighted sum over the spatial locations on $f({\cal I}_{i})$ and $f({\cal I}_{j})$, where each location weight is from the corresponding location in the target objectness prior (TOP) maps $\mathbf{h}_i$ and $\mathbf{h}_j$. (b) During the mini-batch training, for a specific GST (e.g., $\mathbf{z}_{i}$), we obtain two similarity maps between $\mathbf{z}_{i}$ and each location in the context features by using a cross-correlation operation (denoted as $\circledast$). We next use a background selection function $s_{b}(\cdot)$ to mask out target responses and select background counterparts with high responses in the similarity maps to generate the SNSs $\hat{\mathbf{z}}_i$ and $\hat{\mathbf{z}}_j$. The generation of both GST and SNS is memory-efficient. $\otimes$ is element-wise multiplication, while $\oplus$ is a sum over spatial locations.
Figure 3: Target objectness prior (TOP) map generation for a given input image, which consists of proposal generation (including both EdgeBox and random proposal generation) and aggregation of objectness measurements.
Figure 4: A plot of target recall at various numbers of kept proposals after NMS. The evaluation uses an overlap threshold of 0.5.
Figure 5: Examples of target objectness prior (TOP) maps generated by using the combination of random and EdgeBox proposal generation.
...and 6 more figures

Learning Tracking Representations from Single Point Annotations

TL;DR

Abstract

Learning Tracking Representations from Single Point Annotations

Authors

TL;DR

Abstract

Table of Contents

Figures (11)