Table of Contents
Fetching ...

Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

Yongxin Wang, Kris Kitani, Xinshuo Weng

TL;DR

This work tackles the limitation of separately optimizing detection and data association in MOT by introducing GSDT, a joint MOT framework that leverages Graph Neural Networks to model spatial-temporal relations between tracklets and detections. The approach uses a four-module architecture with a GNN-based relation module, multi-layer detection and embedding heads, and losses that are back-propagated across GNN layers, enabling end-to-end training. Empirical results on MOT15/16/17/20 show state-of-the-art performance in both detection and tracking metrics, with ablations highlighting the benefits and trade-offs of multi-layer GNNs. The work demonstrates that incorporating relational reasoning into the joint MOT pipeline improves both object detection quality and data association reliability, advancing online MOT capabilities.

Abstract

Object detection and data association are critical components in multi-object tracking (MOT) systems. Despite the fact that the two components are dependent on each other, prior works often design detection and data association modules separately which are trained with separate objectives. As a result, one cannot back-propagate the gradients and optimize the entire MOT system, which leads to sub-optimal performance. To address this issue, recent works simultaneously optimize detection and data association modules under a joint MOT framework, which has shown improved performance in both modules. In this work, we propose a new instance of joint MOT approach based on Graph Neural Networks (GNNs). The key idea is that GNNs can model relations between variable-sized objects in both the spatial and temporal domains, which is essential for learning discriminative features for detection and data association. Through extensive experiments on the MOT15/16/17/20 datasets, we demonstrate the effectiveness of our GNN-based joint MOT approach and show state-of-the-art performance for both detection and MOT tasks. Our code is available at: https://github.com/yongxinw/GSDT

Joint Object Detection and Multi-Object Tracking with Graph Neural Networks

TL;DR

This work tackles the limitation of separately optimizing detection and data association in MOT by introducing GSDT, a joint MOT framework that leverages Graph Neural Networks to model spatial-temporal relations between tracklets and detections. The approach uses a four-module architecture with a GNN-based relation module, multi-layer detection and embedding heads, and losses that are back-propagated across GNN layers, enabling end-to-end training. Empirical results on MOT15/16/17/20 show state-of-the-art performance in both detection and tracking metrics, with ablations highlighting the benefits and trade-offs of multi-layer GNNs. The work demonstrates that incorporating relational reasoning into the joint MOT pipeline improves both object detection quality and data association reliability, advancing online MOT capabilities.

Abstract

Object detection and data association are critical components in multi-object tracking (MOT) systems. Despite the fact that the two components are dependent on each other, prior works often design detection and data association modules separately which are trained with separate objectives. As a result, one cannot back-propagate the gradients and optimize the entire MOT system, which leads to sub-optimal performance. To address this issue, recent works simultaneously optimize detection and data association modules under a joint MOT framework, which has shown improved performance in both modules. In this work, we propose a new instance of joint MOT approach based on Graph Neural Networks (GNNs). The key idea is that GNNs can model relations between variable-sized objects in both the spatial and temporal domains, which is essential for learning discriminative features for detection and data association. Through extensive experiments on the MOT15/16/17/20 datasets, we demonstrate the effectiveness of our GNN-based joint MOT approach and show state-of-the-art performance for both detection and MOT tasks. Our code is available at: https://github.com/yongxinw/GSDT

Paper Structure

This paper contains 11 sections, 7 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: (Top left): Although jointly training detection and data association, prior work does not take into account object-object relations. (Top right): Prior work leverages object-object relations but only adopts it for data association. By using off-the-shelf detections that cannot be optimized jointly, such disjoint MOT paradigm can lead to sub-optimal MOT performance. (Bottom): Our method leverages spatial-temporal object relations for both detection and data association under a joint MOT framework.
  • Figure 2: (a) Overview of the Proposed Network. We first extract features $\hat{M}_{t-1}^0$, $\hat{M}_{t}^0$ from images $F_t$ and $F_{t-1}$ using a shared backbone. To obtain feature of each tracklet in $T_{t-1}$, we use RoIAlign to crop feature from the image feature $\hat{M}_{t-1}^0$ given the tracklets' boxes (red boxes in $\hat{M}_{t-1}^0$). To obtain features of potential detections, we use feature of every pixel in $\hat{M}_{t}^0$. To construct a graph with manageable number of edges, we only define edges between features of potential detections to tracklets if their spatial distances are within a window (grey boxes in $\hat{M}_{t}^0$). With the constructed graph, we use 3-layer GNNs to update features of tracklets and potential detections via node feature aggregation. A detection and data association head is applied to every layer of GNNs to obtain final detections and matching. (b) Detection and Data Association: The location, box size, and refinement heads generate $\hat{M}^l_L$, $\hat{M}^l_S$, and $\hat{M}^l_R$ which are used to obtain detections. The embedding head generates $\hat{M}^l_E$ to compute identity embedding for data association. (c) Node Feature Aggregation. The mixed color illustrates that features from tracklets and potential detections affect each other via relation modeling.
  • Figure 3: Visualization of our detection and tracking results on two sequences of the MOT17 test set.