Table of Contents
Fetching ...

Multi-Target Tracking with Transferable Convolutional Neural Networks

Damian Owerko, Charilaos I. Kanatsoulis, Jennifer Bondarchuk, Donald J. Bucci, Alejandro Ribeiro

TL;DR

The paper addresses scalable multi-target tracking (MTT) by recasting the problem as image-to-image prediction through 2D target-intensity and measurement-intensity images. It introduces a fully convolutional encoder–decoder CNN trained on small tracking windows and demonstrates transfer to large areas, supported by a theoretical generalization bound. Empirically, the method yields a 29% improvement in OSPA when scaling from 1 km^2 to 25 km^2 and consistently outperforms random finite-set filters (GLMB and LMB) at all scales, while maintaining favorable computational properties. This work offers a scalable, structure-exploiting deep learning solution for MTT with potential applicability to other domains.

Abstract

Multi-target tracking (MTT) is a classical signal processing task, where the goal is to estimate the states of an unknown number of moving targets from noisy sensor measurements. In this paper, we revisit MTT from a deep learning perspective and propose a convolutional neural network (CNN) architecture to tackle it. We represent the target states and sensor measurements as images and recast the problem as an image-to-image prediction task. Then we train a fully convolutional model at small tracking areas and transfer it to much larger areas with numerous targets and sensors. This transfer learning approach enables MTT at a large scale and is also theoretically supported by our novel analysis that bounds the generalization error. In practice, the proposed transferable CNN architecture outperforms random finite set filters on the MTT task with 10 targets and transfers without re-training to a larger MTT task with 250 targets with a 29% performance improvement.

Multi-Target Tracking with Transferable Convolutional Neural Networks

TL;DR

The paper addresses scalable multi-target tracking (MTT) by recasting the problem as image-to-image prediction through 2D target-intensity and measurement-intensity images. It introduces a fully convolutional encoder–decoder CNN trained on small tracking windows and demonstrates transfer to large areas, supported by a theoretical generalization bound. Empirically, the method yields a 29% improvement in OSPA when scaling from 1 km^2 to 25 km^2 and consistently outperforms random finite-set filters (GLMB and LMB) at all scales, while maintaining favorable computational properties. This work offers a scalable, structure-exploiting deep learning solution for MTT with potential applicability to other domains.

Abstract

Multi-target tracking (MTT) is a classical signal processing task, where the goal is to estimate the states of an unknown number of moving targets from noisy sensor measurements. In this paper, we revisit MTT from a deep learning perspective and propose a convolutional neural network (CNN) architecture to tackle it. We represent the target states and sensor measurements as images and recast the problem as an image-to-image prediction task. Then we train a fully convolutional model at small tracking areas and transfer it to much larger areas with numerous targets and sensors. This transfer learning approach enables MTT at a large scale and is also theoretically supported by our novel analysis that bounds the generalization error. In practice, the proposed transferable CNN architecture outperforms random finite set filters on the MTT task with 10 targets and transfers without re-training to a larger MTT task with 250 targets with a 29% performance improvement.
Paper Structure (7 sections, 1 theorem, 9 equations, 3 figures)

This paper contains 7 sections, 1 theorem, 9 equations, 3 figures.

Key Result

Theorem 1

Consider a CNN, defined in eq:cnn, with $L$ layers and a set of filter parameters $\hat{\calH}$. Let $\calL_\sqcap(\hat{\calH})$ be the cost the CNN achieves on the windowed problem as defined by eq:mse_window with an input window of width $A$ and an output window width $B$. Under Assumptions assume

Figures (3)

  • Figure 1: Diagram of the proposed CNN architecture. Above we depict a CNN with a single encoder, hidden, and decoder layer. The filter parameters $\bfH_l \in \mathbb{R}^{N \times N \times F_l \times F_{l-1}}$ are represented by $\bfH_l[i,j] \in \mathbb{R}^{F_l \times F_{l-1}}$, which is indexed by $i,j \in \mathbb{Z}$.
  • Figure 2: The input and output of the CNN at different window widths of $w = \{1,...,5\}$ km. The first row shows the input sensor image, whereas the second row shows the corresponding output of the trained CNN.
  • Figure 3: Comparison of OSPA of the three filters for different window sizes $w$. We report the mean value for 100 simulations at each scale. Error bars indicate a 0.95% confidence interval.

Theorems & Definitions (1)

  • Theorem 1