Table of Contents
Fetching ...

Hierarchical IoU Tracking based on Interval

Yunhao Du, Zhicheng Zhao, Fei Su

TL;DR

The paper addresses multi-object tracking by removing reliance on heavy appearance models and learning-based association, proposing HIT, a unified hierarchical IoU tracking framework that uses tracklet intervals as priors. HIT merges tracklets across multiple hierarchies using IoU-based association with Kalman motion, and introduces three consistency designs to counter inconsistencies in target size, camera movement, and hierarchical cues. The approach yields competitive results on MOT17, KITTI, DanceTrack, and VisDrone, and demonstrates versatile integration as a post-processing refinement for other trackers. Overall, HIT provides a simple yet effective baseline for offline tracking and post-processing, with potential for further gains through future learning-based enhancements.

Abstract

Multi-Object Tracking (MOT) aims to detect and associate all targets of given classes across frames. Current dominant solutions, e.g. ByteTrack and StrongSORT++, follow the hybrid pipeline, which first accomplish most of the associations in an online manner, and then refine the results using offline tricks such as interpolation and global link. While this paradigm offers flexibility in application, the disjoint design between the two stages results in suboptimal performance. In this paper, we propose the Hierarchical IoU Tracking framework, dubbed HIT, which achieves unified hierarchical tracking by utilizing tracklet intervals as priors. To ensure the conciseness, only IoU is utilized for association, while discarding the heavy appearance models, tricky auxiliary cues, and learning-based association modules. We further identify three inconsistency issues regarding target size, camera movement and hierarchical cues, and design corresponding solutions to guarantee the reliability of associations. Though its simplicity, our method achieves promising performance on four datasets, i.e., MOT17, KITTI, DanceTrack and VisDrone, providing a strong baseline for future tracking method design. Moreover, we experiment on seven trackers and prove that HIT can be seamlessly integrated with other solutions, whether they are motion-based, appearance-based or learning-based. Our codes will be released at https://github.com/dyhBUPT/HIT.

Hierarchical IoU Tracking based on Interval

TL;DR

The paper addresses multi-object tracking by removing reliance on heavy appearance models and learning-based association, proposing HIT, a unified hierarchical IoU tracking framework that uses tracklet intervals as priors. HIT merges tracklets across multiple hierarchies using IoU-based association with Kalman motion, and introduces three consistency designs to counter inconsistencies in target size, camera movement, and hierarchical cues. The approach yields competitive results on MOT17, KITTI, DanceTrack, and VisDrone, and demonstrates versatile integration as a post-processing refinement for other trackers. Overall, HIT provides a simple yet effective baseline for offline tracking and post-processing, with potential for further gains through future learning-based enhancements.

Abstract

Multi-Object Tracking (MOT) aims to detect and associate all targets of given classes across frames. Current dominant solutions, e.g. ByteTrack and StrongSORT++, follow the hybrid pipeline, which first accomplish most of the associations in an online manner, and then refine the results using offline tricks such as interpolation and global link. While this paradigm offers flexibility in application, the disjoint design between the two stages results in suboptimal performance. In this paper, we propose the Hierarchical IoU Tracking framework, dubbed HIT, which achieves unified hierarchical tracking by utilizing tracklet intervals as priors. To ensure the conciseness, only IoU is utilized for association, while discarding the heavy appearance models, tricky auxiliary cues, and learning-based association modules. We further identify three inconsistency issues regarding target size, camera movement and hierarchical cues, and design corresponding solutions to guarantee the reliability of associations. Though its simplicity, our method achieves promising performance on four datasets, i.e., MOT17, KITTI, DanceTrack and VisDrone, providing a strong baseline for future tracking method design. Moreover, we experiment on seven trackers and prove that HIT can be seamlessly integrated with other solutions, whether they are motion-based, appearance-based or learning-based. Our codes will be released at https://github.com/dyhBUPT/HIT.
Paper Structure (27 sections, 3 equations, 3 figures, 7 tables)

This paper contains 27 sections, 3 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: The comparison among different offline tracking frameworks. We construct a sequence with eight frames and three targets as example, where the dashed ones represent missing detections. (a) Current dominant hybrid methods first track targets in an online manner, and then refine trajectories with interpolation and global association. (b) Cluster-based methods first generate reliable tracklets and then model tracklet association as the graph partition problem for clustering. (c) Previous hierarchical solutions iteratively match neighboring tracklets with an increasing window size $W$. (d) Our framework also follows the hierarchical paradigm and gradually increases the maximum tracklet interval $\Delta t$ to ensure the purity of results.
  • Figure 2: The illustration of our framework and three inconsistency issues.Left: We illustrate our pipeline with a simple example with 4 frames and 3 targets. After the first hierarchy ($\Delta t=1$), all adjacent detections are associated. Then the second hierarchy ($\Delta t=2$) further identifies the missed association. Right: ① illustrates the "inconsistent target size" issue, in which smaller boxes tend to have lower IoU for given localization errors. ② shows camera movement will cause boxes misalignment across frames, which is named "inconsistent camera movement". ③ clarifies the "inconsistent hierarchical cues", where the first hierarchy can only utilize overlap information of adjacent boxes, while higher hierarchies can incorporate motion information.
  • Figure 3: The illustration of integrating HIT with another tracker. In this example, two trajectories are occluded and switch IDs. In our pipeline, they are first split into four short tracklets based on continuity and then recombined by HIT.