Table of Contents
Fetching ...

Towards Robust Perception for Assistive Robotics: An RGB-Event-LiDAR Dataset and Multi-Modal Detection Pipeline

Adam Scicluna, Cedric Le Gentil, Sheila Sutjipto, Gavin Paul

TL;DR

This work addresses perception for assistive robotics by proposing a robust, multimodal dataset and a pipeline that fuses RGB or event camera data with LiDAR for 3D object localization and tracking. The dataset comprises RGB, event, LiDAR, and IMU streams plus ground-truth poses for two dynamic agents collected indoors with a handheld rig and motion capture, enabling detailed benchmarking. The proposed pipeline conducts 2D detections (RGB via YOLOv4 or event via RVT), 2D tracking (SORT), LiDAR-based 3D point filtering, and 3D tracking with a CVKF, with a formal state $x_{3D} = [x, y, z, \, \dot{x}, \, \dot{y}, \, \dot{z}]^\top$. experiments show frame-based detectors generalize better in this setting while state-of-the-art event models remain more limited to automotive-like contexts; the dataset is released to spur safer and more robust algorithms for assistive robotics, including intelligent disobedience.

Abstract

The increasing adoption of human-robot interaction presents opportunities for technology to positively impact lives, particularly those with visual impairments, through applications such as guide-dog-like assistive robotics. We present a pipeline exploring the perception and "intelligent disobedience" required by such a system. A dataset of two people moving in and out of view has been prepared to compare RGB-based and event-based multi-modal dynamic object detection using LiDAR data for 3D position localisation. Our analysis highlights challenges in accurate 3D localisation using 2D image-LiDAR fusion, indicating the need for further refinement. Compared to the performance of the frame-based detection algorithm utilised (YOLOv4), current cutting-edge event-based detection models appear limited to contextual scenarios, such as for automotive platforms. This is highlighted by weak precision and recall over varying confidence and Intersection over Union (IoU) thresholds when using frame-based detections as a ground truth. Therefore, we have publicly released this dataset to the community, containing RGB, event, point cloud and Inertial Measurement Unit (IMU) data along with ground truth poses for the two people in the scene to fill a gap in the current landscape of publicly available datasets and provide a means to assist in the development of safer and more robust algorithms in the future: https://uts-ri.github.io/revel/.

Towards Robust Perception for Assistive Robotics: An RGB-Event-LiDAR Dataset and Multi-Modal Detection Pipeline

TL;DR

This work addresses perception for assistive robotics by proposing a robust, multimodal dataset and a pipeline that fuses RGB or event camera data with LiDAR for 3D object localization and tracking. The dataset comprises RGB, event, LiDAR, and IMU streams plus ground-truth poses for two dynamic agents collected indoors with a handheld rig and motion capture, enabling detailed benchmarking. The proposed pipeline conducts 2D detections (RGB via YOLOv4 or event via RVT), 2D tracking (SORT), LiDAR-based 3D point filtering, and 3D tracking with a CVKF, with a formal state . experiments show frame-based detectors generalize better in this setting while state-of-the-art event models remain more limited to automotive-like contexts; the dataset is released to spur safer and more robust algorithms for assistive robotics, including intelligent disobedience.

Abstract

The increasing adoption of human-robot interaction presents opportunities for technology to positively impact lives, particularly those with visual impairments, through applications such as guide-dog-like assistive robotics. We present a pipeline exploring the perception and "intelligent disobedience" required by such a system. A dataset of two people moving in and out of view has been prepared to compare RGB-based and event-based multi-modal dynamic object detection using LiDAR data for 3D position localisation. Our analysis highlights challenges in accurate 3D localisation using 2D image-LiDAR fusion, indicating the need for further refinement. Compared to the performance of the frame-based detection algorithm utilised (YOLOv4), current cutting-edge event-based detection models appear limited to contextual scenarios, such as for automotive platforms. This is highlighted by weak precision and recall over varying confidence and Intersection over Union (IoU) thresholds when using frame-based detections as a ground truth. Therefore, we have publicly released this dataset to the community, containing RGB, event, point cloud and Inertial Measurement Unit (IMU) data along with ground truth poses for the two people in the scene to fill a gap in the current landscape of publicly available datasets and provide a means to assist in the development of safer and more robust algorithms in the future: https://uts-ri.github.io/revel/.
Paper Structure (20 sections, 5 figures, 5 tables)

This paper contains 20 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: (a) Sensor suite featuring a DAVIS 346 frame-event camera and a Cube1 LiDAR for dataset collection. (b) DAVIS camera data sample: events (polarity-coloured in red or blue) overlaid on the RGB frame. (c) LiDAR scan sample with object motion-captured ground truth poses (frames).
  • Figure 2: Frames and geometric transformations in the dataset.
  • Figure 3: Block diagram overview of the proposed vision-LiDAR object detection and tracking.
  • Figure 4: Bounding box point cloud segmentation and filtering examples via vision-based (RGB and event) object detection.
  • Figure 5: Detection samples and estimated dynamics in an urban environment.