Table of Contents
Fetching ...

CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

Yizheng Song, Yiyu Zhuang, Qipeng Xu, Haixiang Wang, Jiahe Zhu, Jing Tian, Siyu Zhu, Hao Zhu

Abstract

Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.

CrowdGaussian: Reconstructing High-Fidelity 3D Gaussians for Human Crowd from a Single Image

Abstract

Single-view 3D human reconstruction has garnered significant attention in recent years. Despite numerous advancements, prior research has concentrated on reconstructing 3D models from clear, close-up images of individual subjects, often yielding subpar results in the more prevalent multi-person scenarios. Reconstructing 3D human crowd models is a highly intricate task, laden with challenges such as: 1) extensive occlusions, 2) low clarity, and 3) numerous and various appearances. To address this task, we propose CrowdGaussian, a unified framework that directly reconstructs multi-person 3D Gaussian Splatting (3DGS) representations from single-image inputs. To handle occlusions, we devise a self-supervised adaptation pipeline that enables the pretrained large human model to reconstruct complete 3D humans with plausible geometry and appearance from heavily occluded inputs. Furthermore, we introduce Self-Calibrated Learning (SCL). This training strategy enables single-step diffusion models to adaptively refine coarse renderings to optimal quality by blending identity-preserving samples with clean/corrupted image pairs. The outputs can be distilled back to enhance the quality of multi-person 3DGS representations. Extensive experiments demonstrate that CrowdGaussian generates photorealistic, geometrically coherent reconstructions of multi-person scenes.
Paper Structure (30 sections, 16 equations, 13 figures, 3 tables)

This paper contains 30 sections, 16 equations, 13 figures, 3 tables.

Figures (13)

  • Figure 1: From a single in-the-wild crowd image (top-left), CrowdGaussian reconstructs a high-fidelity multi-person 3D Gaussian scene (blue). Crucially, our method ensures occlusion-robust completion, producing plausible geometry for invisible regions as verified by consistent novel-view rendering (green). Furthermore, it demonstrates exceptional degradation robustness, successfully recovering sharp details even from low-resolution inputs (yellow).
  • Figure 2: Overview of the proposed CrowdGaussian framework. Our pipeline operates in two stages. In Stage 1, we first estimate SMPL-X parameters and segment individuals from the input image. These occluded crops are processed by our LORM to hallucinate complete geometries, assembling an initial coarse multi-person 3DGS scene. In Stage 2, we render this coarse scene into RGB images and normal maps. Our CrowdRefiner leverages these cues to generate high-fidelity pseudo-ground truths, which are then distilled back into the 3D Gaussians via differentiable rendering, significantly enhancing local details and overall sharpness.
  • Figure 3: The Large Occluded Human Reconstruction Model (LORM). (a) Architecture. LORM takes an occluded image and a template to reconstruct a complete 3D human. To preserve priors while enabling efficient adaptation, we freeze pre-trained backbones and inject trainable LoRA exclusively into the transformer. (b) Self-Supervised Training. We employ a Teacher-Student framework where the teacher generates clean pseudo-GTs from complete images. These signals guide the student to hallucinate complete geometries from occluded inputs via self-distillation, achieving robustness without external 3D supervision.
  • Figure 4: Architecture of CrowdRefiner, our single-step diffusion refiner for Crowd 3D Gaussians enhancement. Given a coarse rendering and its corresponding SMPL normal map as geometric prior, CrowdRefiner generates a high-fidelity refined output. The model is fine-tuned from SD-Turbo with a LoRA-finetuned VAE decoder and a trainable PoseNet, while the VAE encoder remains frozen. Zoom-in regions highlight significant improvements in local details such as hair and clothing textures.
  • Figure 5: Effect of Self-Calibrated Learning (SCL). Without SCL (middle), the model tends to over-refine, causing facial distortions and artifacts. With SCL (right), structural integrity is preserved while details are enhanced, demonstrating adaptive refinement enabled by mixed identity supervision during training.
  • ...and 8 more figures