Table of Contents
Fetching ...

RCR: Robust Crowd Reconstruction with Upright Space from a Single Large-scene Image

Jing Huang, Hao Wen, Tianyi Zhou, Haozhe Lin, Yu-kun Lai, Kun Li

TL;DR

This work tackles monocular crowd reconstruction in large-scene images with unknown camera parameters and arbitrary FoVs. It introduces the Human-scene Virtual Interaction Point (HVIP) to resolve depth ambiguity and a canonical Upright 2D/3D Space with Upright Normalization to decouple camera effects from reconstruction, complemented by Iterative Ground-aware Cropping to handle multiple scales. The proposed Robust Crowd Reconstruction (RCR) achieves globally consistent reconstructions in unified camera space without test-time optimization and is supported by two new datasets, LargeCrowd and SynCrowd. Experimental results demonstrate improved reprojection accuracy and spatial consistency, with the code and data to be released for research use.

Abstract

This paper focuses on spatially consistent hundreds of human pose and shape reconstruction from a single large-scene image with various human scales under arbitrary camera FoVs (Fields of View). Due to the small and highly varying 2D human scales, depth ambiguity, and perspective distortion, no existing methods can achieve globally consistent reconstruction with correct reprojection. To address these challenges, we first propose a new concept, Human-scene Virtual Interaction Point (HVIP), to convert the complex 3D human localization into 2D-pixel localization. We then extend it to RCR (Robust Crowd Reconstruction), which achieves globally consistent reconstruction and stable generalization on different camera FoVs without test-time optimization. To perceive humans in varying pixel sizes, we propose an Iterative Ground-aware Cropping to automatically crop the image and then merge the results. To eliminate the influence of the camera and cropping process during the reconstruction, we introduce a canonical Upright 3D Space and the corresponding Upright 2D Space. To link the canonical space and the camera space, we propose the Upright Normalization, which transforms the local crop input into the Upright 2D Space, and transforms the output from the Upright 3D Space into the unified camera space. Besides, we contribute two benchmark datasets, LargeCrowd and SynCrowd, for evaluating crowd reconstruction in large scenes. Experimental results demonstrate the effectiveness of the proposed method. The source code and data will be publicly available for research purposes.

RCR: Robust Crowd Reconstruction with Upright Space from a Single Large-scene Image

TL;DR

This work tackles monocular crowd reconstruction in large-scene images with unknown camera parameters and arbitrary FoVs. It introduces the Human-scene Virtual Interaction Point (HVIP) to resolve depth ambiguity and a canonical Upright 2D/3D Space with Upright Normalization to decouple camera effects from reconstruction, complemented by Iterative Ground-aware Cropping to handle multiple scales. The proposed Robust Crowd Reconstruction (RCR) achieves globally consistent reconstructions in unified camera space without test-time optimization and is supported by two new datasets, LargeCrowd and SynCrowd. Experimental results demonstrate improved reprojection accuracy and spatial consistency, with the code and data to be released for research use.

Abstract

This paper focuses on spatially consistent hundreds of human pose and shape reconstruction from a single large-scene image with various human scales under arbitrary camera FoVs (Fields of View). Due to the small and highly varying 2D human scales, depth ambiguity, and perspective distortion, no existing methods can achieve globally consistent reconstruction with correct reprojection. To address these challenges, we first propose a new concept, Human-scene Virtual Interaction Point (HVIP), to convert the complex 3D human localization into 2D-pixel localization. We then extend it to RCR (Robust Crowd Reconstruction), which achieves globally consistent reconstruction and stable generalization on different camera FoVs without test-time optimization. To perceive humans in varying pixel sizes, we propose an Iterative Ground-aware Cropping to automatically crop the image and then merge the results. To eliminate the influence of the camera and cropping process during the reconstruction, we introduce a canonical Upright 3D Space and the corresponding Upright 2D Space. To link the canonical space and the camera space, we propose the Upright Normalization, which transforms the local crop input into the Upright 2D Space, and transforms the output from the Upright 3D Space into the unified camera space. Besides, we contribute two benchmark datasets, LargeCrowd and SynCrowd, for evaluating crowd reconstruction in large scenes. Experimental results demonstrate the effectiveness of the proposed method. The source code and data will be publicly available for research purposes.

Paper Structure

This paper contains 34 sections, 6 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: Given a single image under arbitrary FoVs, our method can reconstruct human poses and shapes in a unified camera space with both spatial consistency and reprojection accuracy.
  • Figure 2: Our method achieves accurate reprojection (top line), correct relative positions, and plausibility of human-ground interaction (bottom line), while the state-of-the-art method GroupRec and our conference version Crowd3D do not perform well in these aspects.
  • Figure 3: Overview of our method. Our method first detects all the humans and then reconstructs each person via the upright space (details in Fig. \ref{['fig:upright-norm']}). With the help of the canonical Upright Space, though the SMPL estimation is separately conducted in the Upright 3D Space, the results can achieve both spatial consistency and reprojection accuracy in the camera space.
  • Figure 4: Reconstructing each person via Upright 2D/3D Space, Upright Normalization, and HVIPNet.
  • Figure 5: Qualitative comparison on the LargeCrowd dataset. The color of a reconstructed human corresponds to the matched ground truth, while unmatched individuals are shown in gray. In the zoomed-in images A, B, and C, we use color saturation to differentiate whether the reconstructed results are fully within the cropped sub-images, i.e., the lower saturation indicates that the result is not within the area. Shadows are ignored while rendering some compared methods because the results exhibit significant offsets from a unified ground plane.
  • ...and 3 more figures