Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations

Amit Galor; Roy Orfaig; Ben-Zion Bobrovsky

Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations

Amit Galor, Roy Orfaig, Ben-Zion Bobrovsky

TL;DR

This work tackles the bottleneck of motion estimation in Transformer-based multi-object tracking by augmenting TransCenter with a fine-tuned Kalman filter and an appearance embedding network. A heatmap-derived noise model informs the Kalman measurement noise, while a FastReID-based embedding supports robust re-identification within a three-stage cascade association. Empirical results on MOT17 and MOT20 show STC achieving higher HOTA and IDF1 than other Transformer-based trackers, with MOTA remaining competitive and a modest runtime impact. The findings suggest that targeted post-processing and re-ID integration can substantially improve tracker robustness, potentially guiding future all-in-one Transformer MOT designs.

Abstract

Transformer networks have been a focus of research in many fields in recent years, being able to surpass the state-of-the-art performance in different computer vision tasks. However, in the task of Multiple Object Tracking (MOT), leveraging the power of Transformers remains relatively unexplored. Among the pioneering efforts in this domain, TransCenter, a Transformer-based MOT architecture with dense object queries, demonstrated exceptional tracking capabilities while maintaining reasonable runtime. Nonetheless, one critical aspect in MOT, track displacement estimation, presents room for enhancement to further reduce association errors. In response to this challenge, our paper introduces a novel improvement to TransCenter. We propose a post-processing mechanism grounded in the Track-by-Detection paradigm, aiming to refine the track displacement estimation. Our approach involves the integration of a carefully designed Kalman filter, which incorporates Transformer outputs into measurement error estimation, and the use of an embedding network for target re-identification. This combined strategy yields substantial improvement in the accuracy and robustness of the tracking process. We validate our contributions through comprehensive experiments on the MOTChallenge datasets MOT17 and MOT20, where our proposed approach outperforms other Transformer-based trackers. The code is publicly available at: https://github.com/amitgalor18/STC_Tracker

Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations

TL;DR

Abstract

Paper Structure (14 sections, 18 equations, 6 figures, 3 tables)

This paper contains 14 sections, 18 equations, 6 figures, 3 tables.

Introduction
Related Work
Proposed Method
Overview
Kalman Filter
Embedding Network
Experiments
Dataset and Evaluation Metrics
Implementation
Results
Offline modules
Ablation Study
Qualitative Analysis
Conclusion

Figures (6)

Figure 1: A flowchart overview of our STC tracker. The TransCenter Transcenter_Xu2022 main architecture was simplified on the left. The additional blocks are in dark red (Kalman filter \ref{['kalman_filter']} and Embedding Network \ref{['embedding_network']}) and the modified blocks are in purple. The detection and track positions are used to calculate the GIoU distances GIOU_Rezatofighi_2018_CVPR, while the detection and track embeddings are used to calculate the appearance distance. The cascade matching contains two association steps that match new detections with existing tracks using a combined appearance and GIoU score. The Re-ID module attempts to match remaining detections with inactive tracks. The post-processing block is an optional addition, as in StrongSORTByteTrack_Zhang2021Aharon2022 and described in \ref{['offline']}. The area in pink background is further detailed in Fig \ref{['fig:association_flowchart']}.
Figure 2: Visualization of the FWHM calculation. The TransCenter Transformer predicts a heatmap of the object detections. The blue window on the right is a zoom-in on an area in the left of the frame. The width of each peak (FWHM) is calculated in the x and y direction, and is denoted by Fx and Fy.
Figure 3: The main contributions of this paper: A flowchart of the modified association stage. The new modules are in the red background, and correspond to the Kalman, Embedding and 1st Association blocks from Fig \ref{['fig:flowchart']}. The fine-tuned Kalman filter uses information from the detection heatmap to improve the GIoU distance accuracy, while the appearance embedding distance helps in cases of pedestrians crowded together. The Hungarian algorithm for linear association is then applied to the resulting cost matrix. The process is then repeated for the 2nd Association and Re-ID stages, as seen in Fig \ref{['fig:flowchart']}.
Figure 4: Visualization of a specific scenario in the results of a tracker based only on transformer (TransCenterV2 Transcenter_Xu2022) at the top and our STC tracker results (Transformer with Kalman and Embedding) at the bottom. The green trajectories are the tracker prediction with a history of 20 frames and the orange trajectories are the ground truth positions. An IDSW has occurred in the top image, and is demonstrated by the big "jump" in the trajectory from right to left, giving the new pedestrian that emerged an existing track ID. In the bottom image the trajectory only began when the pedestrian emerged. The results are from the MOT17-09 video on frame 389.
Figure 5: Visualization of a specific scenario in the results of our tracker when based on the default Kalman implementation seen in many works Wojke2017StrongSORTByteTrack_Zhang2021FairMOT, compared with our tracker after the modifications to the Kalman filter. The green trajectories are the tracker prediction with a history of 20 frames and the orange trajectories are the ground truth positions. The image on the left shows a person reappearing from the right after an occlusion and receiving a new ID by the tracker, while a new person in the back appears in the frame and receives a false existing ID with a long history. In the image on the right, both errors are corrected with the modified Kalman filter. The results are from the MOT17-11 video on frame 720.
...and 1 more figures

Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations

TL;DR

Abstract

Strong-TransCenter: Improved Multi-Object Tracking based on Transformers with Dense Representations

Authors

TL;DR

Abstract

Table of Contents

Figures (6)