Table of Contents
Fetching ...

PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search

Chensheng Peng, Zhaoyu Zeng, Jinling Gao, Jundong Zhou, Masayoshi Tomizuka, Xinbing Wang, Chenghu Zhou, Nanyang Ye

TL;DR

The paper addresses real-time multi-modal MOT for autonomous driving by introducing a latency-constrained neural architecture search framework that operates on a Pareto frontier between accuracy and latency. It adopts a two-stage NAS process: Stage I finds a backbone under latency constraints, and Stage II retrains with pruning to maximize accuracy, yielding a final model ζ*=(α*,θ*). A multi-modal fusion module combines image and LiDAR features with weighted fusion to improve robustness when one sensor underperforms. Experiments on KITTI show 89.59% MOTA with latency under 80 ms on edge devices, and latency analyses across devices demonstrate practical edge deployment, highlighting the method’s potential for efficient autonomous driving.

Abstract

Multiple object tracking is a critical task in autonomous driving. Existing works primarily focus on the heuristic design of neural networks to obtain high accuracy. As tracking accuracy improves, however, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy. Another challenge for object tracking is the unreliability of a single sensor, therefore, we propose a multi-modal framework to improve the robustness. Experiments demonstrate that our algorithm can run on edge devices within lower latency constraints, thus greatly reducing the computational requirements for multi-modal object tracking while keeping lower latency.

PNAS-MOT: Multi-Modal Object Tracking with Pareto Neural Architecture Search

TL;DR

The paper addresses real-time multi-modal MOT for autonomous driving by introducing a latency-constrained neural architecture search framework that operates on a Pareto frontier between accuracy and latency. It adopts a two-stage NAS process: Stage I finds a backbone under latency constraints, and Stage II retrains with pruning to maximize accuracy, yielding a final model ζ*=(α*,θ*). A multi-modal fusion module combines image and LiDAR features with weighted fusion to improve robustness when one sensor underperforms. Experiments on KITTI show 89.59% MOTA with latency under 80 ms on edge devices, and latency analyses across devices demonstrate practical edge deployment, highlighting the method’s potential for efficient autonomous driving.

Abstract

Multiple object tracking is a critical task in autonomous driving. Existing works primarily focus on the heuristic design of neural networks to obtain high accuracy. As tracking accuracy improves, however, neural networks become increasingly complex, posing challenges for their practical application in real driving scenarios due to the high level of latency. In this paper, we explore the use of the neural architecture search (NAS) methods to search for efficient architectures for tracking, aiming for low real-time latency while maintaining relatively high accuracy. Another challenge for object tracking is the unreliability of a single sensor, therefore, we propose a multi-modal framework to improve the robustness. Experiments demonstrate that our algorithm can run on edge devices within lower latency constraints, thus greatly reducing the computational requirements for multi-modal object tracking while keeping lower latency.
Paper Structure (18 sections, 6 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 18 sections, 6 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: The structure overview of the tracking framework. We take image patches and LiDAR point clouds as input, which are processed respectively by a corresponding encoder. A fusion module is then used to fuse the multi-modal features. The fused features are split into tracklet features and detection features, from which the correlation features can be calculated. These features are input to the following adjacent estimator to infer confidence and affinity scores. At last, linear programming is used to update the new tracklets from the inferred scores.
  • Figure 2: Successful and failed examples of multi-object tracking. Green bounding box: The assigned IDs 0, 227, and 229 are consistent in two consecutive frames $t-1$ and $t$, which are successful trackings. Red bounding box: The ID 220 is wrongly assigned to the black vehicle in frame $t$, which is a failure tracking. Under the tracking-by-detection setting, Multi-object tracking can be regarded as a downstream detection task. It is hard to track dynamic objects under occlusion, which is common in real driving scenarios and significant for safe autonomous driving.
  • Figure 3: The searching and training process is divided into two stages. We obtain the optimal network structures in the first stage through constrained NAS and get the optimal network parameters in the second training stage.
  • Figure 4: Comparison with other methods on the KITTI testing dataset. The x-axis represents the reciprocal of the latency. As shown in the figure, our method minimally sacrifices the MOTA metrics and achieves outstanding performance on latency, with less than half the latency of mmMOT.
  • Figure 5: Tracking results of our multi-object tracking model. We illustrate the tracking results of different sequences from the testing dataset of the KITTI benchmark. These are some of the extreme conditions under highly complex environmental challenges. Yet our model accurately keeps track of every object despite the interference of illumination, shadows, and occlusions. This suggests our multi-object tracking model has high accuracy and robustness concerning complex background environments.