RoboSense: Large-scale Dataset and Benchmark for Egocentric Robot Perception and Navigation in Crowded and Unstructured Environments
Haisheng Su, Feixiang Song, Cong Ma, Wei Wu, Junchi Yan
TL;DR
RoboSense introduces a first-of-its-kind egocentric perception dataset and benchmark suite tailored for crowded, unstructured environments, bridging a gap between autonomous navigation needs and existing driving-focused datasets. It provides a rich, multi-sensor platform with full $360^{\circ}$ coverage, $133K$+ synchronized frames, $1.4\mathrm{M}$ annotated 3D boxes with IDs, and $216K$ trajectories across $7.6K$ sequences, plus dense occupancy labels to support safe navigation. The paper defines six standardized tasks—3D detection, multi-object tracking, motion forecasting, and occupancy prediction—together with specialized metrics like Closest-Collision Distance Proportion to emphasize near-field performance. A comprehensive benchmark with multiple baselines across LiDAR, vision, and multi-modal fusion demonstrates the dataset’s challenging nature, highlights the benefits of sensor fusion for near-field perception, and underscores the need for improved near-field localization in crowded environments. RoboSense aims to accelerate development of egocentric perception and navigation systems for social mobile robots operating in real-world crowds, with privacy-preserving measures and future plans for extended tasks such as motion planning and joint optimization.
Abstract
Reliable embodied perception from an egocentric perspective is challenging yet essential for autonomous navigation technology of intelligent mobile agents. With the growing demand of social robotics, near-field scene understanding becomes an important research topic in the areas of egocentric perceptual tasks related to navigation in both crowded and unstructured environments. Due to the complexity of environmental conditions and difficulty of surrounding obstacles owing to truncation and occlusion, the perception capability under this circumstance is still inferior. To further enhance the intelligence of mobile robots, in this paper, we setup an egocentric multi-sensor data collection platform based on 3 main types of sensors (Camera, LiDAR and Fisheye), which supports flexible sensor configurations to enable dynamic sight of view from ego-perspective, capturing either near or farther areas. Meanwhile, a large-scale multimodal dataset is constructed, named RoboSense, to facilitate egocentric robot perception. Specifically, RoboSense contains more than 133K synchronized data with 1.4M 3D bounding box and IDs annotated in the full $360^{\circ}$ view, forming 216K trajectories across 7.6K temporal sequences. It has $270\times$ and $18\times$ as many annotations of surrounding obstacles within near ranges as the previous datasets collected for autonomous driving scenarios such as KITTI and nuScenes. Moreover, we define a novel matching criterion for near-field 3D perception and prediction metrics. Based on RoboSense, we formulate 6 popular tasks to facilitate the future research development, where the detailed analysis as well as benchmarks are also provided accordingly. Data desensitization measures have been conducted for privacy protection.
