Table of Contents
Fetching ...

Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset

Sithu Aung, Min-Cheol Sagong, Junghyun Cho

TL;DR

This work tackles the challenge of predicting dense pedestrian occupancy in large, multi-view urban scenes by introducing the MVP-Occ synthetic dataset and the OmniOcc baseline. MVP-Occ provides voxel-level semantic labels and panoptic occupancy across five expansive scenes, enabling training for both 2D ground-plane occupancy and full 3D scene understanding. OmniOcc combines image encoding, a view-to-voxel projection, a 3D voxel encoder, and dual heads for 3D semantic/pedestrian occupancy with a BEV 2D occupancy head, followed by instance grouping to produce instance and panoptic outputs. Across same-scene and synthetic-to-real evaluations (WildTrack), OmniOcc achieves state-of-the-art performance, with ablations showing the benefits of semantic scene understanding and pedestrian instance grouping for robust cross-domain generalization and realistic scene reconstruction.

Abstract

We address an advanced challenge of predicting pedestrian occupancy as an extension of multi-view pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information. Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.

Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset

TL;DR

This work tackles the challenge of predicting dense pedestrian occupancy in large, multi-view urban scenes by introducing the MVP-Occ synthetic dataset and the OmniOcc baseline. MVP-Occ provides voxel-level semantic labels and panoptic occupancy across five expansive scenes, enabling training for both 2D ground-plane occupancy and full 3D scene understanding. OmniOcc combines image encoding, a view-to-voxel projection, a 3D voxel encoder, and dual heads for 3D semantic/pedestrian occupancy with a BEV 2D occupancy head, followed by instance grouping to produce instance and panoptic outputs. Across same-scene and synthetic-to-real evaluations (WildTrack), OmniOcc achieves state-of-the-art performance, with ablations showing the benefits of semantic scene understanding and pedestrian instance grouping for robust cross-domain generalization and realistic scene reconstruction.

Abstract

We address an advanced challenge of predicting pedestrian occupancy as an extension of multi-view pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information. Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.

Paper Structure

This paper contains 62 sections, 22 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Visualizations of the proposed dataset. The primary objective is to predict the semantic and instance labels of the voxels and determine each pedestrian's location within the scene. The dataset includes five expansive scenes with dense pedestrian activity. (Best viewed in color.)
  • Figure 2: Overview of the proposed model. Image features are extracted using a backbone network augmented with an FPN. Next, multi-view 2D features are projected onto the voxel grid along rays and processed with a 3D U-Net to construct a feature volume. Semantic occupancy predictions are generated using a two-layer MLP network, whereas a single convolutional layer predicts the occupancy status of pedestrians. Finally, pedestrian instances are grouped on the basis of both predictions to obtain instance and panoptic occupancy labels. (Best viewed in color.)
  • Figure 3: Qualitative results of 2D and 3D occupancy predictions under same-scene evaluation on the Park scene. (Best viewed in color.)
  • Figure 4: Qualitative results of synthetic-to-real transfer from Facade to WildTrack. (Best viewed in color.)
  • Figure 5: Camera views used for generating a scene point cloud. The number of additional camera views varies depending on the scene's characteristics, such as its size and level of occlusion. However, every scene includes an overhead camera to ensure comprehensive coverage.
  • ...and 3 more figures