Table of Contents
Fetching ...

Lightweight LiDAR-Camera 3D Dynamic Object Detection and Multi-Class Trajectory Prediction

Yushen He, Lei Zhao, Tianchen Deng, Zipeng Fang, Weidong Chen

TL;DR

This work presents a lightweight, modular LiDAR-Camera framework for real-time 3D dynamic object detection and multi-class trajectory prediction on resource-constrained mobile robots. The Detection module CMDT leverages 3D Multi-Modal Deformable Attention and targeted training strategies to improve cross-modal accuracy with lower computation, while the RTMCT predictor uses learnable reference trajectories and a two-layer transformer to produce diverse, class-aware future paths without heavy generative models. The system achieves strong CODa performance and real-time inference on a wheelchair robot (13.2 fps), with extensive ablations validating the design choices. This approach offers a practical path toward deployable, multi-class, dynamic perception for daily-life mobile robotics, with open-source implementations to facilitate adoption.

Abstract

Service mobile robots are often required to avoid dynamic objects while performing their tasks, but they usually have only limited computational resources. So we present a lightweight multi-modal framework for 3D object detection and trajectory prediction. Our system synergistically integrates LiDAR and camera inputs to achieve real-time perception of pedestrians, vehicles, and riders in 3D space. The framework proposes two novel modules: 1) a Cross-Modal Deformable Transformer (CMDT) for object detection with high accuracy and acceptable amount of computation, and 2) a Reference Trajectory-based Multi-Class Transformer (RTMCT) for efficient and diverse trajectory prediction of mult-class objects with flexible trajectory lengths. Evaluations on the CODa benchmark demonstrate superior performance over existing methods across detection (+2.03% in mAP) and trajectory prediction (-0.408m in minADE5 of pedestrians) metrics. Remarkably, the system exhibits exceptional deployability - when implemented on a wheelchair robot with an entry-level NVIDIA 3060 GPU, it achieves real-time inference at 13.2 fps. To facilitate reproducibility and practical deployment, we release the related code of the method at https://github.com/TossherO/3D_Perception and its ROS inference version at https://github.com/TossherO/ros_packages.

Lightweight LiDAR-Camera 3D Dynamic Object Detection and Multi-Class Trajectory Prediction

TL;DR

This work presents a lightweight, modular LiDAR-Camera framework for real-time 3D dynamic object detection and multi-class trajectory prediction on resource-constrained mobile robots. The Detection module CMDT leverages 3D Multi-Modal Deformable Attention and targeted training strategies to improve cross-modal accuracy with lower computation, while the RTMCT predictor uses learnable reference trajectories and a two-layer transformer to produce diverse, class-aware future paths without heavy generative models. The system achieves strong CODa performance and real-time inference on a wheelchair robot (13.2 fps), with extensive ablations validating the design choices. This approach offers a practical path toward deployable, multi-class, dynamic perception for daily-life mobile robotics, with open-source implementations to facilitate adoption.

Abstract

Service mobile robots are often required to avoid dynamic objects while performing their tasks, but they usually have only limited computational resources. So we present a lightweight multi-modal framework for 3D object detection and trajectory prediction. Our system synergistically integrates LiDAR and camera inputs to achieve real-time perception of pedestrians, vehicles, and riders in 3D space. The framework proposes two novel modules: 1) a Cross-Modal Deformable Transformer (CMDT) for object detection with high accuracy and acceptable amount of computation, and 2) a Reference Trajectory-based Multi-Class Transformer (RTMCT) for efficient and diverse trajectory prediction of mult-class objects with flexible trajectory lengths. Evaluations on the CODa benchmark demonstrate superior performance over existing methods across detection (+2.03% in mAP) and trajectory prediction (-0.408m in minADE5 of pedestrians) metrics. Remarkably, the system exhibits exceptional deployability - when implemented on a wheelchair robot with an entry-level NVIDIA 3060 GPU, it achieves real-time inference at 13.2 fps. To facilitate reproducibility and practical deployment, we release the related code of the method at https://github.com/TossherO/3D_Perception and its ROS inference version at https://github.com/TossherO/ros_packages.

Paper Structure

This paper contains 26 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Detection and trajectory prediction results of our method on CODa.
  • Figure 2: Architecture of CMDT
  • Figure 3: Architecture of MMDA module
  • Figure 4: Architecture of RTMCT
  • Figure 5: Mobile robot platform.
  • ...and 2 more figures