Table of Contents
Fetching ...

RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments

Haisheng Su, Feixiang Song, Cong Ma, Wei Wu, Junchi Yan

TL;DR

RoboSense introduces a first-of-its-kind egocentric perception dataset and benchmark suite tailored for crowded, unstructured environments, bridging a gap between autonomous navigation needs and existing driving-focused datasets. It provides a rich, multi-sensor platform with full $360^{\circ}$ coverage, $133K$+ synchronized frames, $1.4\mathrm{M}$ annotated 3D boxes with IDs, and $216K$ trajectories across $7.6K$ sequences, plus dense occupancy labels to support safe navigation. The paper defines six standardized tasks—3D detection, multi-object tracking, motion forecasting, and occupancy prediction—together with specialized metrics like Closest-Collision Distance Proportion to emphasize near-field performance. A comprehensive benchmark with multiple baselines across LiDAR, vision, and multi-modal fusion demonstrates the dataset’s challenging nature, highlights the benefits of sensor fusion for near-field perception, and underscores the need for improved near-field localization in crowded environments. RoboSense aims to accelerate development of egocentric perception and navigation systems for social mobile robots operating in real-world crowds, with privacy-preserving measures and future plans for extended tasks such as motion planning and joint optimization.

Abstract

Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future research development, where the detailed analysis as well as benchmarks are also provided accordingly. Data desensitization measures have been conducted for privacy protection.

RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments

TL;DR

RoboSense introduces a first-of-its-kind egocentric perception dataset and benchmark suite tailored for crowded, unstructured environments, bridging a gap between autonomous navigation needs and existing driving-focused datasets. It provides a rich, multi-sensor platform with full coverage, + synchronized frames, annotated 3D boxes with IDs, and trajectories across sequences, plus dense occupancy labels to support safe navigation. The paper defines six standardized tasks—3D detection, multi-object tracking, motion forecasting, and occupancy prediction—together with specialized metrics like Closest-Collision Distance Proportion to emphasize near-field performance. A comprehensive benchmark with multiple baselines across LiDAR, vision, and multi-modal fusion demonstrates the dataset’s challenging nature, highlights the benefits of sensor fusion for near-field perception, and underscores the need for improved near-field localization in crowded environments. RoboSense aims to accelerate development of egocentric perception and navigation systems for social mobile robots operating in real-world crowds, with privacy-preserving measures and future plans for extended tasks such as motion planning and joint optimization.

Abstract

Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full view, forming 216K trajectories across 7.6K temporal sequences. It has and as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future research development, where the detailed analysis as well as benchmarks are also provided accordingly. Data desensitization measures have been conducted for privacy protection.
Paper Structure (39 sections, 10 equations, 13 figures, 9 tables)

This paper contains 39 sections, 10 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: An example from RoboSense dataset: The data with annotated 3D boxes and occupancy descriptions on Camera, Fisheye, LiDAR, and BEV respectively, where the same targets are associated with unique IDs across different devices and timestamps.
  • Figure 2: Comparison of annotated object distribution among different popular datasets used for perception and prediction tasks.
  • Figure 3: Sensor setup and coordinate system illustration of our data collection platform.
  • Figure 4: Average precision vs. matching function. CD: Center Distance. CDP: Center Distance Proportion. CCDP: Closest-Collision Distance Proportion. IOU: Intersection Over Union. We set IOU of Vehicle, Cyclist and Pedestrian to [0.7, 0.5, 0.5] following KITTI geiger2013vision. CD is set to 2$m$ following nuScenes caesar2020nuscenes and CDP/CCDP=5% for TP metrics.
  • Figure A1: Comparison of annotated object distribution of different classes between RoboSense and nuScenes datasets.
  • ...and 8 more figures