Table of Contents
Fetching ...

Temporally Propagated Masks and Bounding Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

Tomasz Stanczyk, Francois Bremond

TL;DR

A novel approach, McByte, is proposed, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework, which enhances robustness and generalizability without per-sequence tuning.

Abstract

Multi-object tracking (MOT) involves identifying and consistently tracking objects across video sequences. Traditional tracking-by-detection methods, while effective, often require extensive tuning and lack generalizability. On the other hand, segmentation mask-based methods are more generic but struggle with tracking management, making them unsuitable for MOT. We propose a novel approach, McByte, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework. By combining bounding box and propagated mask information, McByte enhances robustness and generalizability without per-sequence tuning. Evaluated on four benchmark datasets - DanceTrack, MOT17, SoccerNet-tracking 2022, and KITTI-tracking - McByte demonstrates performance gain in all cases examined. At the same time, it outperforms existing mask-based methods. Implementation code will be provided upon acceptance.

Temporally Propagated Masks and Bounding Boxes: Combining the Best of Both Worlds for Multi-Object Tracking

TL;DR

A novel approach, McByte, is proposed, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework, which enhances robustness and generalizability without per-sequence tuning.

Abstract

Multi-object tracking (MOT) involves identifying and consistently tracking objects across video sequences. Traditional tracking-by-detection methods, while effective, often require extensive tuning and lack generalizability. On the other hand, segmentation mask-based methods are more generic but struggle with tracking management, making them unsuitable for MOT. We propose a novel approach, McByte, which incorporates a temporally propagated segmentation mask as a strong association cue within a tracking-by-detection framework. By combining bounding box and propagated mask information, McByte enhances robustness and generalizability without per-sequence tuning. Evaluated on four benchmark datasets - DanceTrack, MOT17, SoccerNet-tracking 2022, and KITTI-tracking - McByte demonstrates performance gain in all cases examined. At the same time, it outperforms existing mask-based methods. Implementation code will be provided upon acceptance.
Paper Structure (30 sections, 4 equations, 8 figures, 13 tables)

This paper contains 30 sections, 4 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Temporally propagated mask can be helpful in cases of high occlusion. The person with the red mask is tracked only by its limited visible parts (pointed by yellow arrows for the clarity). Input image data from mot17_ref. Best seen in color.
  • Figure 2: McByte tracking pipeline with the mask cue guidance. Temporally propagated mask signal is incorporated as an association cue in the tracklet-detection association steps.
  • Figure 3: Visual output comparison between the baseline and McByte. With the temporally propagated mask guidance, McByte can handle longer occlusion in the crowd - see the subject with ID 54 on the output of McByte. Input image data from mot17_ref. Best seen in color.
  • Figure 4: Cases showing the differences in $mm_{1}$ and $mm_{2}$ (\ref{['sec:method_mask_use']}) values of a temporally propagated mask (in blue) within a bounding box. The most optimal case for the mask to provide a good guidance is the second one from the left, where both $mm_{1}$ and $mm_{2}$ are as close to $1$ as possible.
  • Figure 5: An example of detection quality difference. It can be seen that considerably more bounding boxes are missing in case of public detections, which negatively impacts the MOT performance. Input image data from mot17_ref, sequence MOT17-04, last frame (1050). Best seen in color.
  • ...and 3 more figures