Table of Contents
Fetching ...

UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking

Bishoy Galoaa, Xiangyu Bai, Utsav Nandi, Sai Siddhartha Vivek Dhir Rangoju, Somaieh Amraee, Sarah Ostadabbas

TL;DR

UniTrack addresses persistent identity maintenance in multi-object tracking by introducing a differentiable graph-theoretic loss that jointly optimizes detection accuracy, identity preservation, and spatiotemporal coherence. Casting MOT as a sliding-window flow optimization over a graph with balance variables and flow conservation, it introduces adaptive Laplacian-based weighting to balance spatial and temporal terms. The universal loss can be plugged into existing MOT systems without architectural changes, delivering consistent improvements across TrackFormer, MOTR, FairMOT, ByteTrack, GTR, and MOTE on MOT17, MOT20, SportsMOT, and DanceTrack, including substantial reductions in ID switches and gains in IDF1/HOTA. The work provides theoretical convergence guarantees and analyzes frame-rate robustness, while noting training-time overhead (~5% memory) and a current single-camera focus as future directions toward multi-camera tracking.

Abstract

We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53\% reduction in identity switches and 12\% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7\% MOTA on SportsMOT.

UniTrack: Differentiable Graph Representation Learning for Multi-Object Tracking

TL;DR

UniTrack addresses persistent identity maintenance in multi-object tracking by introducing a differentiable graph-theoretic loss that jointly optimizes detection accuracy, identity preservation, and spatiotemporal coherence. Casting MOT as a sliding-window flow optimization over a graph with balance variables and flow conservation, it introduces adaptive Laplacian-based weighting to balance spatial and temporal terms. The universal loss can be plugged into existing MOT systems without architectural changes, delivering consistent improvements across TrackFormer, MOTR, FairMOT, ByteTrack, GTR, and MOTE on MOT17, MOT20, SportsMOT, and DanceTrack, including substantial reductions in ID switches and gains in IDF1/HOTA. The work provides theoretical convergence guarantees and analyzes frame-rate robustness, while noting training-time overhead (~5% memory) and a current single-camera focus as future directions toward multi-camera tracking.

Abstract

We present UniTrack, a plug-and-play graph-theoretic loss function designed to significantly enhance multi-object tracking (MOT) performance by directly optimizing tracking-specific objectives through unified differentiable learning. Unlike prior graph-based MOT methods that redesign tracking architectures, UniTrack provides a universal training objective that integrates detection accuracy, identity preservation, and spatiotemporal consistency into a single end-to-end trainable loss function, enabling seamless integration with existing MOT systems without architectural modifications. Through differentiable graph representation learning, UniTrack enables networks to learn holistic representations of motion continuity and identity relationships across frames. We validate UniTrack across diverse tracking models and multiple challenging benchmarks, demonstrating consistent improvements across all tested architectures and datasets including Trackformer, MOTR, FairMOT, ByteTrack, GTR, and MOTE. Extensive evaluations show up to 53\% reduction in identity switches and 12\% IDF1 improvements across challenging benchmarks, with GTR achieving peak performance gains of 9.7\% MOTA on SportsMOT.
Paper Structure (19 sections, 2 theorems, 25 equations, 7 figures, 8 tables)

This paper contains 19 sections, 2 theorems, 25 equations, 7 figures, 8 tables.

Key Result

Theorem 1

The UniTrack loss function $\mathcal{L} = \mathcal{L}_{\text{flow}} + \lambda_s\mathcal{L}_{\text{spatial}} + \lambda_t\mathcal{L}_{\text{temporal}}$ satisfies differentiability, local convergence under standard regularity conditions, and ensures physically plausible tracking solutions via flow cons

Figures (7)

  • Figure 1: Comparison of UniTrack's graph-based approach and classical multi-object tracking. (A) A detection-based tracking handles trajectories independently, resulting in ID switches at occlusion: person 1 reassigned to ID 3 and person 2 to ID 1 (crossing arrows and red-highlighted boxes). (B) Our graph-based approach maintains correct identities through the same occlusion via three integrated components: temporal edges (red arrows) for motion consistency, spatial edges (green lines) for inter-object relationships, and flow components (blue dashed ellipses) for identity preservation. Green-highlighted ID boxes show successful identity maintenance throughout the sequence. Ground truth (blue circles) and predictions (dark centers) demonstrate how unified optimization prevents ID switches in challenging scenarios.
  • Figure 2: Illustration of tracking errors in MOTR zeng2021motr on MOT17 sequences 8 and 9. The first row (sequence 9) highlights post-occlusion ID switches (error Type 1): subject 1 loses tracking when occluded behind subject 4 in frame 1.B, with IDs 8, 3, and 1 subsequently reassigned as IDs 15, 14, and 12 in frame 1.C. The second row (sequence 8) demonstrates temporal inconsistency (error Type 2), where the tracker fails to maintain IDs when subjects change postures: ID 15 changes to 27 (2.B), and in 2.C, MOTR erroneously assigns two bounding boxes to subject 6, 13 while ID 14 changes to 10 and ID 15 is reassigned to 29, illustrating instability in temporal association. The third row (sequence 9) demonstrates cross-subject ID switches (error Type 3): ID 22 and 3 are correctly assigned in 3.A, but when subject 42 occludes subject 3 in 3.B, it triggers a cascade of errors--ID 22 gets incorrectly swapped with ID 80, followed by ID 3 being erroneously reassigned as ID 88 in 3.C, showcasing how occlusions propagate tracking failures.
  • Figure 3: Comparative performance of MOTR zeng2021motr and UT-MOTR revisiting the same challenging scenarios from Figure \ref{['fig:Trackingchallenge']}. Red bounding boxes indicate tracking errors, green boxes show successful tracking, and dotted red boxes highlight missed detections. UT-MOTR successfully addresses the three error types demonstrated in Figure \ref{['fig:Trackingchallenge']}: maintaining consistent IDs through occlusions (frames 1.B-1.C), preserving temporal consistency during posture changes (frames 2.B-2.C), and preventing cross-subject ID switches in crowded scenes. The unified graph-theoretic approach enables robust tracking where the baseline MOTR fails. Additional qualitative results with Trackformer are provided in Section \ref{['sec:trackformerqual']}.
  • Figure 4: Frame-rate resilience analysis of UniTrack. (A) HOTA scores show UT-GTR maintains superior performance, with both methods plateauing around 5-15 FPS. (B) Performance improvements of UT-GTR over GTR: HOTA gap decreases from 12% to 7% as frame rate increases, while MOTA and IDF1 gaps increase sharply from 1-5 FPS before converging. UniTrack maintains consistent advantages across all frame rates.
  • Figure 5: Loss surface evolution during training with and without UniTrack loss over the MOT17 dataset. Contour plots show normalized loss landscapes at iteration 800 (left) and 8000 (right) for models trained with UniTrack loss (top row, viridis colormap) and baseline training (bottom row, plasma colormap). The UniTrack loss creates broader, more stable convergence basins with smoother gradients, while the baseline approach results in narrower, more fragmented loss landscapes. Parameters $\alpha$ and $\beta$ represent perturbations around trained model weights. All surfaces are normalized to [0,1] for visual comparison.
  • ...and 2 more figures

Theorems & Definitions (2)

  • Theorem 1: Unified Convergence and Consistency
  • Theorem 2: Unified Differentiability and Local Convergence Properties