Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset
Sithu Aung, Min-Cheol Sagong, Junghyun Cho
TL;DR
This work tackles the challenge of predicting dense pedestrian occupancy in large, multi-view urban scenes by introducing the MVP-Occ synthetic dataset and the OmniOcc baseline. MVP-Occ provides voxel-level semantic labels and panoptic occupancy across five expansive scenes, enabling training for both 2D ground-plane occupancy and full 3D scene understanding. OmniOcc combines image encoding, a view-to-voxel projection, a 3D voxel encoder, and dual heads for 3D semantic/pedestrian occupancy with a BEV 2D occupancy head, followed by instance grouping to produce instance and panoptic outputs. Across same-scene and synthetic-to-real evaluations (WildTrack), OmniOcc achieves state-of-the-art performance, with ablations showing the benefits of semantic scene understanding and pedestrian instance grouping for robust cross-domain generalization and realistic scene reconstruction.
Abstract
We address an advanced challenge of predicting pedestrian occupancy as an extension of multi-view pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information. Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.
