Table of Contents
Fetching ...

Transformer-based assignment decision network for multiple object tracking

Athena Psalta, Vasileios Tsironis, Konstantinos Karantzalos

TL;DR

This paper introduces Transformer-based Assignment Decision Network (TADN) for data association in online tracking-by-detection MOT. TADN directly infers detections-to-target assignments in a single forward pass by producing an Assignment Score Matrix $ASM$ over $N$ detections and $M+1$ targets (including a $null$ target) and computing $A_{final}=\text{argmax}(ASM)$ row-wise, enabling end-to-end differentiable MOT with a simple tracking framework. The authors present two TADN architectures (single and dual-branch), a training strategy leveraging a Label Assignment Matrix (LAM) and a progressive predictor/teacher mix, and comprehensive experiments on MOT17, MOT20, and UA-DETRAC showing competitive MOT metrics and real-time association speeds (~10 Hz). Although the baseline tracker is simple and lacks re-identification and occlusion handling, the results demonstrate TADN's potential as a lightweight, transferable data association module. The work suggests promising avenues for integrating TADN into more sophisticated MOT systems and real-time applications, including embedded platforms.

Abstract

Data association is a crucial component for any multiple object tracking (MOT) method that follows the tracking-by-detection paradigm. To generate complete trajectories such methods employ a data association process to establish assignments between detections and existing targets during each timestep. Recent data association approaches try to solve either a multi-dimensional linear assignment task or a network flow minimization problem or tackle it via multiple hypotheses tracking. However, during inference an optimization step that computes optimal assignments is required for every sequence frame inducing additional complexity to any given solution. To this end, in the context of this work we introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of any explicit optimization during inference. In particular, TADN can directly infer assignment pairs between detections and active targets in a single forward pass of the network. We have integrated TADN in a rather simple MOT framework, designed a novel training strategy for efficient end-to-end training and demonstrated the high potential of our approach for online visual tracking-by-detection MOT on several popular benchmarks, i.e. MOT17, MOT20 and UA-DETRAC. Our proposed approach demonstrates strong performance in most evaluation metrics despite its simple nature as a tracker lacking significant auxiliary components such as occlusion handling or re-identification. The implementation of our method is publicly available at https://github.com/psaltaath/tadn-mot.

Transformer-based assignment decision network for multiple object tracking

TL;DR

This paper introduces Transformer-based Assignment Decision Network (TADN) for data association in online tracking-by-detection MOT. TADN directly infers detections-to-target assignments in a single forward pass by producing an Assignment Score Matrix over detections and targets (including a target) and computing row-wise, enabling end-to-end differentiable MOT with a simple tracking framework. The authors present two TADN architectures (single and dual-branch), a training strategy leveraging a Label Assignment Matrix (LAM) and a progressive predictor/teacher mix, and comprehensive experiments on MOT17, MOT20, and UA-DETRAC showing competitive MOT metrics and real-time association speeds (~10 Hz). Although the baseline tracker is simple and lacks re-identification and occlusion handling, the results demonstrate TADN's potential as a lightweight, transferable data association module. The work suggests promising avenues for integrating TADN into more sophisticated MOT systems and real-time applications, including embedded platforms.

Abstract

Data association is a crucial component for any multiple object tracking (MOT) method that follows the tracking-by-detection paradigm. To generate complete trajectories such methods employ a data association process to establish assignments between detections and existing targets during each timestep. Recent data association approaches try to solve either a multi-dimensional linear assignment task or a network flow minimization problem or tackle it via multiple hypotheses tracking. However, during inference an optimization step that computes optimal assignments is required for every sequence frame inducing additional complexity to any given solution. To this end, in the context of this work we introduce Transformer-based Assignment Decision Network (TADN) that tackles data association without the need of any explicit optimization during inference. In particular, TADN can directly infer assignment pairs between detections and active targets in a single forward pass of the network. We have integrated TADN in a rather simple MOT framework, designed a novel training strategy for efficient end-to-end training and demonstrated the high potential of our approach for online visual tracking-by-detection MOT on several popular benchmarks, i.e. MOT17, MOT20 and UA-DETRAC. Our proposed approach demonstrates strong performance in most evaluation metrics despite its simple nature as a tracker lacking significant auxiliary components such as occlusion handling or re-identification. The implementation of our method is publicly available at https://github.com/psaltaath/tadn-mot.
Paper Structure (13 sections, 13 equations, 7 figures, 6 tables, 3 algorithms)

This paper contains 13 sections, 13 equations, 7 figures, 6 tables, 3 algorithms.

Figures (7)

  • Figure 1: Single and dual branch configurations for TADN. Left: Single branch approach is constituted of a single Transformer model. Target and Detection input streams are fed to the decoder and encoder part respectively. Target output stream is the Transformer's standard output, while Detection output stream the output of the Transformer's encoder. Right: Dual branch version uses two separate Transformer models. Each output stream corresponds to the output of each Transformer model. Detection and Target input stream are fed to the decoder and encoder part respectively for the Detection branch and vice-versa for the Target branch. In both architectures, a $null\; target$ embedding is concatenated to the Target input stream before feeding it to the Transformer.
  • Figure 2: Overview of our MOT pipeline. Two input streams containing positional and appearance information are generated for $N$ detections and $M$ currently active targets respectively. These are fed to TADN to compute an $(N \times M+1)$ similarity. Final assignments are directly computed via a row-wise argmax operation. The final assignments include the $null\; target$ case.
  • Figure 3: Success cases from MOT17 test set TADN results in presence of partial occlusion, various scene geometries and illumination conditions. Each pair is $\sim$ 30 frames apart.
  • Figure 4: MOTA on UA-DETRAC test set for detection thresholds along the PR-curve.
  • Figure 5: Failure examples from MOT17 test set TADN results in presence of poor detections and strong occlusions. Cyan lines: id-switches. Red ellipses: Falsely tracked objects.
  • ...and 2 more figures