Table of Contents
Fetching ...

SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking

Yu-Hsiang Wang, Jun-Wei Hsieh, Ping-Yang Chen, Ming-Ching Chang, Hung Hin So, Xin Li

TL;DR

SMILEtrack tackles occlusion and appearance similarity in single-camera MOT by decoupling detection and embedding and introducing a Siamese Similarity Learning Module (SLM) with Patch Self-Attention to compute reliable appearance affinity. The Similarity Matching Cascade (SMC), augmented by a novel GATE function, performs robust cross-frame data association by fusing IOU and learned appearance scores, using a two-stage Hungarian matching process. Empirical results on MOT17 and MOT20 show state-of-the-art or competitive performance in MOTA, IDF1, and HOTA, with real-time inference speeds, highlighting a favorable cost-performance trade-off for an SDE-based MOT approach. The work demonstrates that targeted appearance-based similarity learning and gated, multi-template matching can substantially mitigate ID switches under occlusion, with potential impact on surveillance, autonomous driving, and robotics where efficient, accurate MOT is critical."

Abstract

Despite recent progress in Multiple Object Tracking (MOT), several obstacles such as occlusions, similar objects, and complex scenes remain an open challenge. Meanwhile, a systematic study of the cost-performance tradeoff for the popular tracking-by-detection paradigm is still lacking. This paper introduces SMILEtrack, an innovative object tracker that effectively addresses these challenges by integrating an efficient object detector with a Siamese network-based Similarity Learning Module (SLM). The technical contributions of SMILETrack are twofold. First, we propose an SLM that calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding (SDE) models. The SLM incorporates a Patch Self-Attention (PSA) block inspired by the vision Transformer, which generates reliable features for accurate similarity matching. Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames, further enhancing MOT performance. Together, these innovations help SMILETrack achieve an improved trade-off between the cost ({\em e.g.}, running speed) and performance (e.g., tracking accuracy) over several existing state-of-the-art benchmarks, including the popular BYTETrack method. SMILETrack outperforms BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets. Code is available at https://github.com/pingyang1117/SMILEtrack_Official

SMILEtrack: SiMIlarity LEarning for Occlusion-Aware Multiple Object Tracking

TL;DR

SMILEtrack tackles occlusion and appearance similarity in single-camera MOT by decoupling detection and embedding and introducing a Siamese Similarity Learning Module (SLM) with Patch Self-Attention to compute reliable appearance affinity. The Similarity Matching Cascade (SMC), augmented by a novel GATE function, performs robust cross-frame data association by fusing IOU and learned appearance scores, using a two-stage Hungarian matching process. Empirical results on MOT17 and MOT20 show state-of-the-art or competitive performance in MOTA, IDF1, and HOTA, with real-time inference speeds, highlighting a favorable cost-performance trade-off for an SDE-based MOT approach. The work demonstrates that targeted appearance-based similarity learning and gated, multi-template matching can substantially mitigate ID switches under occlusion, with potential impact on surveillance, autonomous driving, and robotics where efficient, accurate MOT is critical."

Abstract

Despite recent progress in Multiple Object Tracking (MOT), several obstacles such as occlusions, similar objects, and complex scenes remain an open challenge. Meanwhile, a systematic study of the cost-performance tradeoff for the popular tracking-by-detection paradigm is still lacking. This paper introduces SMILEtrack, an innovative object tracker that effectively addresses these challenges by integrating an efficient object detector with a Siamese network-based Similarity Learning Module (SLM). The technical contributions of SMILETrack are twofold. First, we propose an SLM that calculates the appearance similarity between two objects, overcoming the limitations of feature descriptors in Separate Detection and Embedding (SDE) models. The SLM incorporates a Patch Self-Attention (PSA) block inspired by the vision Transformer, which generates reliable features for accurate similarity matching. Second, we develop a Similarity Matching Cascade (SMC) module with a novel GATE function for robust object matching across consecutive video frames, further enhancing MOT performance. Together, these innovations help SMILETrack achieve an improved trade-off between the cost ({\em e.g.}, running speed) and performance (e.g., tracking accuracy) over several existing state-of-the-art benchmarks, including the popular BYTETrack method. SMILETrack outperforms BYTETrack by 0.4-0.8 MOTA and 2.1-2.2 HOTA points on MOT17 and MOT20 datasets. Code is available at https://github.com/pingyang1117/SMILEtrack_Official
Paper Structure (17 sections, 4 equations, 8 figures, 5 tables)

This paper contains 17 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparative analysis of HOTA-MOTA-FPS for different trackers on the MOT17 test set. X-axis: FPS (running speed). Y-axis: HOTA. Circle radius: MOTA score. SMILEtrack registers 80.7 MOTA and 65.0 HOTA at 37.5 FPS, exceeding all other trackers (see Table \ref{['tab:SoTA:MOT17']} for details).
  • Figure 2: The architecture of the proposed SMILEtracker. SMILEtracker is a Siamese network-like architecture that learns the appearance features of two objects and calculates their similarity score. SMILEtracker consists of two modules: (i) object detection and (ii) object association.
  • Figure 3: Appearance similarity between low-score detection at the current frame and tracks at the previous frame.
  • Figure 4: Different types of patch layout: configuration (E) achieves the best performance because it can actively attend to PSA-occluded parts when occlusion occurs.
  • Figure 5: The Patch Self-Attention (PSA) architecture.
  • ...and 3 more figures