Table of Contents
Fetching ...

Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking

Milad Khanchi, Maria Amer, Charalambos Poullis

TL;DR

This work tackles the limitations of traditional IoU- and appearance-based MOT by introducing depth-aware, segmentation-guided association. It fuses zero-shot monocular depth with promptable segmentation to form depth-segmentation embeddings, which are refined by a self-supervised encoder and combined with motion and appearance cues in a total matching score $Match_t = S_{IoU_t} + S_{ang_t} + S_{sd_t} + S_{emb_t}$ for data association via the Hungarian algorithm. The approach, SelfTrEncMOT, demonstrates strong identity preservation in occluded and non-linear motion scenarios (DanceTrack, SportsMOT) while remaining competitive on linear-motion datasets (MOT17), and emphasizes the benefit of pixel-aligned geometric cues beyond 2D overlaps. A practical limitation is the depth estimation bottleneck (DepthPro), motivating future work on real-time depth estimators and contrastive objectives to enhance the encoder's discriminability and robustness.

Abstract

Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.

Fast Self-Supervised depth and mask aware Association for Multi-Object Tracking

TL;DR

This work tackles the limitations of traditional IoU- and appearance-based MOT by introducing depth-aware, segmentation-guided association. It fuses zero-shot monocular depth with promptable segmentation to form depth-segmentation embeddings, which are refined by a self-supervised encoder and combined with motion and appearance cues in a total matching score for data association via the Hungarian algorithm. The approach, SelfTrEncMOT, demonstrates strong identity preservation in occluded and non-linear motion scenarios (DanceTrack, SportsMOT) while remaining competitive on linear-motion datasets (MOT17), and emphasizes the benefit of pixel-aligned geometric cues beyond 2D overlaps. A practical limitation is the depth estimation bottleneck (DepthPro), motivating future work on real-time depth estimators and contrastive objectives to enhance the encoder's discriminability and robustness.

Abstract

Multi-object tracking (MOT) methods often rely on Intersection-over-Union (IoU) for association. However, this becomes unreliable when objects are similar or occluded. Also, computing IoU for segmentation masks is computationally expensive. In this work, we use segmentation masks to capture object shapes, but we do not compute segmentation IoU. Instead, we fuse depth and mask features and pass them through a compact encoder trained self-supervised. This encoder produces stable object representations, which we use as an additional similarity cue alongside bounding box IoU and re-identification features for matching. We obtain depth maps from a zero-shot depth estimator and object masks from a promptable visual segmentation model to obtain fine-grained spatial cues. Our MOT method is the first to use the self-supervised encoder to refine segmentation masks without computing masks IoU. MOT can be divided into joint detection-ReID (JDR) and tracking-by-detection (TBD) models. The latter are computationally more efficient. Experiments of our TBD method on challenging benchmarks with non-linear motion, occlusion, and crowded scenes, such as SportsMOT and DanceTrack, show that our method outperforms the TBD state-of-the-art on most metrics, while achieving competitive performance on simpler benchmarks with linear motion, such as MOT17.

Paper Structure

This paper contains 11 sections, 5 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) Overview of SelfTrEncMOT. Given consecutive video frames and their object detector bounding boxes, we extract motion and appearance embeddings, and compute depth maps (via zero-shot monocular estimation) and segmentation masks (via Promptable Visual Segmentation). Depth and segmentation cues are fused into depth-segmentation embeddings and refined by a self-supervised encoder. The final association score integrates these embeddings with motion and appearance cues using a linear assignment solver. (b) Architecture of the depth-segmentation autoencoder. (c) Example of the encoder's input embedding.
  • Figure 2: Qualitative results of the depth-segmentation autoencoder. Top: input fused embeddings; Bottom: reconstructions. The encoder preserves key spatial details and object boundaries, supporting robust association, despite variations in scale and structure.