Table of Contents
Fetching ...

No Identity, no problem: Motion through detection for people tracking

Martin Engilberge, F. Wilke Grosche, Pascal Fua

TL;DR

This paper proposes exploiting motion clues while providing supervision only for the detections, which is much easier to do and delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.

Abstract

Tracking-by-detection has become the de facto standard approach to people tracking. To increase robustness, some approaches incorporate re-identification using appearance models and regressing motion offset, which requires costly identity annotations. In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do. Our algorithm predicts detection heatmaps at two different times, along with a 2D motion estimate between the two images. It then warps one heatmap using the motion estimate and enforces consistency with the other one. This provides the required supervisory signal on the motion without the need for any motion annotations. In this manner, we couple the information obtained from different images during training and increase accuracy, especially in crowded scenes and when using low frame-rate sequences. We show that our approach delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.

No Identity, no problem: Motion through detection for people tracking

TL;DR

This paper proposes exploiting motion clues while providing supervision only for the detections, which is much easier to do and delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.

Abstract

Tracking-by-detection has become the de facto standard approach to people tracking. To increase robustness, some approaches incorporate re-identification using appearance models and regressing motion offset, which requires costly identity annotations. In this paper, we propose exploiting motion clues while providing supervision only for the detections, which is much easier to do. Our algorithm predicts detection heatmaps at two different times, along with a 2D motion estimate between the two images. It then warps one heatmap using the motion estimate and enforces consistency with the other one. This provides the required supervisory signal on the motion without the need for any motion annotations. In this manner, we couple the information obtained from different images during training and increase accuracy, especially in crowded scenes and when using low frame-rate sequences. We show that our approach delivers state-of-the-art results for single- and multi-view multi-target tracking on the MOT17 and WILDTRACK datasets.

Paper Structure

This paper contains 59 sections, 7 equations, 15 figures, 12 tables.

Figures (15)

  • Figure 1: Predicting human motion.Left: We use muSSP Wang19f to link detections at different frame rates. We plot the MOTA and IDF1 metrics as a function of the frame rate. Below 3FPS, the degradation becomes severe. Right: Our model estimates a detection heatmap at time $t$ and predicts the motion of objects between $t$ and $t+1$. The offsets are used to warp the heatmap into a prediction at time $t+1$ and we enforce consistency between that prediction and the one estimated by the network using the image acquired at time $t+1$.
  • Figure 2: Details of the proposed differentiable reconstruction from motion Given a detection map at time $t$ and an offset map capturing motion of objects between time $t$ and time $t+1$ we reconstruct the detection map at time $t+1$. Each reconstructed pixel is a weighted sum of the detection of the previous time step, the weights are derived from the distance between the reconstructed location and the expected position of the previous locations after being moved by the offset. The example above illustrates the reconstruction for three locations, for the bottom one it is mainly unaffected by the offset and the distance weight is therefore a disc decreasing as the distance to the reconstructed location increases. For the middle reconstruction, the offset shows that the object has moved away from that location, therefore the corresponding weight for that location will be small. For the top reconstruction, the offset arrives at that location, therefore the contribution from the starting point of the offset will be high.
  • Figure 2: Tracking performance with low frame rate on MOT17 validation. We also evaluate our approach in a monocular setting. We use the MOT17 dataset and modify YOLOX and Bytetrack to predict and use motion.
  • Figure 3: Reconstruction weight function During the reconstruction from detection and offset, the contributions of previous detections are reweighted using the function plotted above. With a value of $\lambda_r=0.8$ a location 2 pixels away from the reconstructed location (distance accounting for motion offset) has a weight of 1 and is fully added to the reconstruction at that location. By varying $\lambda_r$ we control the trade-off between reconstruction accuracy and differentiability. Note that when $\lambda_r=5$ the weight value is only one when the pixel distance is smaller than 0.5 (represented by the dashed, black line), during reconstruction this means that the detection in the previous time step only contributes to a single location in the next one.
  • Figure 4: Network Architectures. Our approach to motion prediction is flexible and usable in conjunction with various detectors/trackers. Trainable components have a blue background, while static modules have a white background. Top Left: Single-view setup where the detector's serves as the primary training signal for the motion predictor. Bottom Left: Multi-view with simultaneous training for detection and motion. ResNet feature extractor produces feature maps $F$. These are projected onto a common ground via view homographies $\mathbf{H}_c$ resulting in $G$. These are combined, and passed to a scene aggregator producing an initial detection heatmap $\mathbf{X}^{t}$ and a motion offset map $\Delta^{t,t+1}$, which a differentiable reconstruction module uses to compute the next frame's detection map $\hat{\mathbf{X}}^{t+1}$. Right: The predicted motion can be leveraged by trackers to predict higher quality trajectories.
  • ...and 10 more figures