Table of Contents
Fetching ...

OmniTracker: Unifying Object Tracking by Tracking-with-Detection

Junke Wang, Zuxuan Wu, Dongdong Chen, Chong Luo, Xiyang Dai, Lu Yuan, Yu-Gang Jiang

TL;DR

A novel tracking-with-detection paradigm is proposed, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association, and a unified tracking model, OmniTracker, is presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline.

Abstract

Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video sequence, which is an important vision task with various real-world applications. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, VOT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Different definitions have led to divergent solutions for these two types of tasks, resulting in redundant training expenses and parameter overhead. In this paper, combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline, eliminating the need for task-specific architectures and reducing redundancy in model parameters. We conduct extensive experimentation on seven prominent tracking datasets of different tracking tasks, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, and demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.

OmniTracker: Unifying Object Tracking by Tracking-with-Detection

TL;DR

A novel tracking-with-detection paradigm is proposed, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association, and a unified tracking model, OmniTracker, is presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline.

Abstract

Visual Object Tracking (VOT) aims to estimate the positions of target objects in a video sequence, which is an important vision task with various real-world applications. Depending on whether the initial states of target objects are specified by provided annotations in the first frame or the categories, VOT could be classified as instance tracking (e.g., SOT and VOS) and category tracking (e.g., MOT, MOTS, and VIS) tasks. Different definitions have led to divergent solutions for these two types of tasks, resulting in redundant training expenses and parameter overhead. In this paper, combing the advantages of the best practices developed in both communities, we propose a novel tracking-with-detection paradigm, where tracking supplements appearance priors for detection and detection provides tracking with candidate bounding boxes for the association. Equipped with such a design, a unified tracking model, OmniTracker, is further presented to resolve all the tracking tasks with a fully shared network architecture, model weights, and inference pipeline, eliminating the need for task-specific architectures and reducing redundancy in model parameters. We conduct extensive experimentation on seven prominent tracking datasets of different tracking tasks, including LaSOT, TrackingNet, DAVIS16-17, MOT17, MOTS20, and YTVIS19, and demonstrate that OmniTracker achieves on-par or even better results than both task-specific and unified tracking models.
Paper Structure (33 sections, 8 equations, 11 figures, 11 tables)

This paper contains 33 sections, 8 equations, 11 figures, 11 tables.

Figures (11)

  • Figure 1: Comparisons between different tracking paradigms. In tracking-as-detection, the tracker delineates a search region or matches with the memory for the detector, while in tracking-by-detection, the detector predicts the bounding boxes for the tracker to associate. We combine the advantages of them and propose a novel tracking-with-detection, where a Reference-guided Feature Enhancement (RFE) module supplements the detector with the appearance priors, and the tracker then associates all the detected boxes with the existing trajectories according to their spatial and appearance correlation.
  • Figure 2: Reflection on previous tracking paradigms: the tracking-as-detection fails when the target object moves rapidly and the estimated search region is incorrect, while the tracking-by-detection fails when the target objects cannot be detected.
  • Figure 3: Overview of the proposed OmniTracker, which consists of a backbone network to extract the multi-scale frame features, a Reference-guided Feature Enhancement (REF) module to model the correlation between the target objects and the tracking frame, and a deformable DETR-based detector to predict the bounding boxes and instance masks. Note that we share the network architecture and inference pipeline for all the tracking tasks. IT: instance tracking, CT: category tracking.
  • Figure 4: Architecture of the proposed RFE module.
  • Figure 5: Illustration of different heads in OmniTracker. All components marked with solid lines are shared for different tasks, while components marked with dotted lines are task-specific. We input a task indicator to the model to indicate which classifier to use during both training and inference.
  • ...and 6 more figures