Table of Contents
Fetching ...

Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

Lei Cheng, Arindam Sengupta, Siyang Cao

TL;DR

This work tackles robust multi-object tracking for autonomous driving by fusing radar and camera data with a deep learning-based MOT framework. It combines a FaceNet-inspired appearance embedding with a Bi-LSTM motion model and a tri-output fusion that yields per-sensor and fused trackers, improving robustness to sensor failures and low-visibility conditions. Key contributions include a discriminative 128-d appearance embedding trained with triplet loss (pre-trained on CASIA-WebFace and fine-tuned on a custom dataset with a threshold of 1.09 achieving 98.65% accuracy), a Bi-LSTM-based motion predictor, and a dual-cue association via Hungarian matching that efficiently fuses appearance and position cues. Evaluations on controlled parking-lot data and the NuScenes dataset, using metrics such as $MOTA$, $MOTP$, $AMOTA$, and $AMOTP$, show improved MOTP and competitive MOTA, with real-time performance (~14 FPS), demonstrating practical applicability and resilience in adverse sensing conditions.

Abstract

Autonomous driving holds great promise in addressing traffic safety concerns by leveraging artificial intelligence and sensor technology. Multi-Object Tracking plays a critical role in ensuring safer and more efficient navigation through complex traffic scenarios. This paper presents a novel deep learning-based method that integrates radar and camera data to enhance the accuracy and robustness of Multi-Object Tracking in autonomous driving systems. The proposed method leverages a Bi-directional Long Short-Term Memory network to incorporate long-term temporal information and improve motion prediction. An appearance feature model inspired by FaceNet is used to establish associations between objects across different frames, ensuring consistent tracking. A tri-output mechanism is employed, consisting of individual outputs for radar and camera sensors and a fusion output, to provide robustness against sensor failures and produce accurate tracking results. Through extensive evaluations of real-world datasets, our approach demonstrates remarkable improvements in tracking accuracy, ensuring reliable performance even in low-visibility scenarios.

Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

TL;DR

This work tackles robust multi-object tracking for autonomous driving by fusing radar and camera data with a deep learning-based MOT framework. It combines a FaceNet-inspired appearance embedding with a Bi-LSTM motion model and a tri-output fusion that yields per-sensor and fused trackers, improving robustness to sensor failures and low-visibility conditions. Key contributions include a discriminative 128-d appearance embedding trained with triplet loss (pre-trained on CASIA-WebFace and fine-tuned on a custom dataset with a threshold of 1.09 achieving 98.65% accuracy), a Bi-LSTM-based motion predictor, and a dual-cue association via Hungarian matching that efficiently fuses appearance and position cues. Evaluations on controlled parking-lot data and the NuScenes dataset, using metrics such as , , , and , show improved MOTP and competitive MOTA, with real-time performance (~14 FPS), demonstrating practical applicability and resilience in adverse sensing conditions.

Abstract

Autonomous driving holds great promise in addressing traffic safety concerns by leveraging artificial intelligence and sensor technology. Multi-Object Tracking plays a critical role in ensuring safer and more efficient navigation through complex traffic scenarios. This paper presents a novel deep learning-based method that integrates radar and camera data to enhance the accuracy and robustness of Multi-Object Tracking in autonomous driving systems. The proposed method leverages a Bi-directional Long Short-Term Memory network to incorporate long-term temporal information and improve motion prediction. An appearance feature model inspired by FaceNet is used to establish associations between objects across different frames, ensuring consistent tracking. A tri-output mechanism is employed, consisting of individual outputs for radar and camera sensors and a fusion output, to provide robustness against sensor failures and produce accurate tracking results. Through extensive evaluations of real-world datasets, our approach demonstrates remarkable improvements in tracking accuracy, ensuring reliable performance even in low-visibility scenarios.
Paper Structure (22 sections, 18 equations, 12 figures, 5 tables)

This paper contains 22 sections, 18 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: The general framework of the proposed MOT method. Only the sensor fusion tracker is shown here; the radar and camera trackers follow the same processing steps.
  • Figure 2: The network architecture of the appearance model.
  • Figure 3: The distance between the same object (Above) and the distance between the different objects (Below).
  • Figure 4: The network architecture of the motion model. The image data branch network is only for the image motion prediction model.
  • Figure 5: The prediction result of the testing dataset for the image model. Values in pixel.
  • ...and 7 more figures