Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

Lei Cheng; Arindam Sengupta; Siyang Cao

Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

Lei Cheng, Arindam Sengupta, Siyang Cao

TL;DR

This work tackles robust multi-object tracking for autonomous driving by fusing radar and camera data with a deep learning-based MOT framework. It combines a FaceNet-inspired appearance embedding with a Bi-LSTM motion model and a tri-output fusion that yields per-sensor and fused trackers, improving robustness to sensor failures and low-visibility conditions. Key contributions include a discriminative 128-d appearance embedding trained with triplet loss (pre-trained on CASIA-WebFace and fine-tuned on a custom dataset with a threshold of 1.09 achieving 98.65% accuracy), a Bi-LSTM-based motion predictor, and a dual-cue association via Hungarian matching that efficiently fuses appearance and position cues. Evaluations on controlled parking-lot data and the NuScenes dataset, using metrics such as $MOTA$, $MOTP$, $AMOTA$, and $AMOTP$, show improved MOTP and competitive MOTA, with real-time performance (~14 FPS), demonstrating practical applicability and resilience in adverse sensing conditions.

Abstract

Autonomous driving holds great promise in addressing traffic safety concerns by leveraging artificial intelligence and sensor technology. Multi-Object Tracking plays a critical role in ensuring safer and more efficient navigation through complex traffic scenarios. This paper presents a novel deep learning-based method that integrates radar and camera data to enhance the accuracy and robustness of Multi-Object Tracking in autonomous driving systems. The proposed method leverages a Bi-directional Long Short-Term Memory network to incorporate long-term temporal information and improve motion prediction. An appearance feature model inspired by FaceNet is used to establish associations between objects across different frames, ensuring consistent tracking. A tri-output mechanism is employed, consisting of individual outputs for radar and camera sensors and a fusion output, to provide robustness against sensor failures and produce accurate tracking results. Through extensive evaluations of real-world datasets, our approach demonstrates remarkable improvements in tracking accuracy, ensuring reliable performance even in low-visibility scenarios.

Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

TL;DR

, and

, show improved MOTP and competitive MOTA, with real-time performance (~14 FPS), demonstrating practical applicability and resilience in adverse sensing conditions.

Abstract

Paper Structure (22 sections, 18 equations, 12 figures, 5 tables)

This paper contains 22 sections, 18 equations, 12 figures, 5 tables.

Introduction
Related Works
Preliminaries
Multi-Object Tracking
Similarity computation
Association strategy
Track management
Radar-Camera Sensor Fusion
Radar-Camera Co-Calibration
Sensor Fusion Schemes
FaceNet and Bi-LSTM
FaceNet
Bi-LSTM
Methodology
Appearance Model based on FaceNet
...and 7 more sections

Figures (12)

Figure 1: The general framework of the proposed MOT method. Only the sensor fusion tracker is shown here; the radar and camera trackers follow the same processing steps.
Figure 2: The network architecture of the appearance model.
Figure 3: The distance between the same object (Above) and the distance between the different objects (Below).
Figure 4: The network architecture of the motion model. The image data branch network is only for the image motion prediction model.
Figure 5: The prediction result of the testing dataset for the image model. Values in pixel.
...and 7 more figures

Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

TL;DR

Abstract

Deep Learning-Based Robust Multi-Object Tracking via Fusion of mmWave Radar and Camera Sensors

Authors

TL;DR

Abstract

Table of Contents

Figures (12)