Table of Contents
Fetching ...

Siamese-DETR for Generic Multi-Object Tracking

Qiankun Liu, Yichen Li, Yuqi Jiang, Ying Fu

TL;DR

This paper focuses on TIMOT and proposes a simple but effective method, Siamese-DETR, which surpasses existing MOT methods on GMOT-40 dataset by a large margin and leverages the inherent object queries in DETR variants.

Abstract

The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories. Recently, Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track interested objects beyond pre-defined categories with the given text prompt and template image. However, the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models. In this paper, we focus on GMOT and propose a simple but effective method, Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing GMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin. Codes are avaliable at \url{https://github.com/yumu-173/Siamese-DETR}.

Siamese-DETR for Generic Multi-Object Tracking

TL;DR

This paper focuses on TIMOT and proposes a simple but effective method, Siamese-DETR, which surpasses existing MOT methods on GMOT-40 dataset by a large margin and leverages the inherent object queries in DETR variants.

Abstract

The ability to detect and track the dynamic objects in different scenes is fundamental to real-world applications, e.g., autonomous driving and robot navigation. However, traditional Multi-Object Tracking (MOT) is limited to tracking objects belonging to the pre-defined closed-set categories. Recently, Open-Vocabulary MOT (OVMOT) and Generic MOT (GMOT) are proposed to track interested objects beyond pre-defined categories with the given text prompt and template image. However, the expensive well pre-trained (vision-)language model and fine-grained category annotations are required to train OVMOT models. In this paper, we focus on GMOT and propose a simple but effective method, Siamese-DETR, for GMOT. Only the commonly used detection datasets (e.g., COCO) are required for training. Different from existing GMOT methods, which train a Single Object Tracking (SOT) based detector to detect interested objects and then apply a data association based MOT tracker to get the trajectories, we leverage the inherent object queries in DETR variants. Specifically: 1) The multi-scale object queries are designed based on the given template image, which are effective for detecting different scales of objects with the same category as the template image; 2) A dynamic matching training strategy is introduced to train Siamese-DETR on commonly used detection datasets, which takes full advantage of provided annotations; 3) The online tracking pipeline is simplified through a tracking-by-query manner by incorporating the tracked boxes in previous frame as additional query boxes. The complex data association is replaced with the much simpler Non-Maximum Suppression (NMS). Extensive experimental results show that Siamese-DETR surpasses existing MOT methods on GMOT-40 dataset by a large margin. Codes are avaliable at \url{https://github.com/yumu-173/Siamese-DETR}.
Paper Structure (34 sections, 7 equations, 7 figures, 10 tables)

This paper contains 34 sections, 7 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: The online tracking pipeline of Siamese-DETR for generic multi-object tracking based on template image. The template image is fed into the backbone network to get the query contents, while the query boxes consist of the learned query boxes and the tracked boxes in the previous frame. With this design, the objects in current frame are tracked by their corresponding boxes, while the missed objects in the previous frame (but still exist in the current frame) or newly appeared objects in the current frame are detected and tracked by the learned query boxes.
  • Figure 2: Overview of Siamese-DETR in the training stage. The multi-scale object queries are decoupled into learnable query boxes and query contents. The query contents are mapped from the multi-scale features extracted from the template image by the backbone network. The model is trained with Hungarian loss carion2020end and the proposed dynamic matching training strategy which turns the provided annotations into positive and negative samples dynamically according to the given template image. For simplicity, the optimized query denoising is not presented in the figure.
  • Figure 3: Illustration of query denoising. (a) Input image and template image. (b) Original query denoising li2022dn with conflicts for TIMOT. The noisy object queries are classified according to the labeled category IDs that are associated with the query boxes, without taking the noisy query contents into consideration. (c) Optimized query denoising. The noisy object queries are classified according to the matching results between the query contents and noisy query boxes. The numbers 1 and 0 denote that the model tries to classify the object queries into positive and negative samples, while the markers ✗ and $\checkmark$ indicate whether the classification behaviors are wrong or right.
  • Figure 4: Qualitative comparison for different methods. The following two points can be summarized: 1) when combined with the same tracker (e.g., TbQ), our Siamese-DETR tracks more objects than GLIP-T (B) li2022grounded; 2) based on the same detector, our TbQ pipeline also tracks more objects than SORT bewley2016simple. The Siamese-DETR trained on COCO lin2014microsoft with Swin-Tliu2021swin as the backbone network is evaluated.
  • Figure 5: The used template image for each category. We present the used 4 template images for each category since there are 4 videos for each category. All template images are padded to a square resolution.
  • ...and 2 more figures