Table of Contents
Fetching ...

HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

Panwang Pan, Tingting Shen, Chenxin Li, Yunlong Lin, Kairun Wen, Jingjing Zhao, Yixuan Yuan

TL;DR

HumanCrafter tackles the challenge of simultaneous 3D human reconstruction and body-part segmentation from a single image. It introduces a feed-forward pipeline that converts aggregated multi-view features into pixel-aligned 3D Gaussian Primitives, with a second transformer producing semantic 3D Gaussians and a differentiable renderer. The model leverages human priors (e.g., SMPL, Plücker embeddings) and diffusion-based appearance priors, enabling cross-task learning with a joint render-distillation-segmentation objective, achieving state-of-the-art results in both 3D segmentation and single-image 3D reconstruction while running in real time. These capabilities enable practical applications in AR/VR, editing, and immersive exploration, while the work also discusses ethical considerations and future directions.

Abstract

Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.

HumanCrafter: Synergizing Generalizable Human Reconstruction and Semantic 3D Segmentation

TL;DR

HumanCrafter tackles the challenge of simultaneous 3D human reconstruction and body-part segmentation from a single image. It introduces a feed-forward pipeline that converts aggregated multi-view features into pixel-aligned 3D Gaussian Primitives, with a second transformer producing semantic 3D Gaussians and a differentiable renderer. The model leverages human priors (e.g., SMPL, Plücker embeddings) and diffusion-based appearance priors, enabling cross-task learning with a joint render-distillation-segmentation objective, achieving state-of-the-art results in both 3D segmentation and single-image 3D reconstruction while running in real time. These capabilities enable practical applications in AR/VR, editing, and immersive exploration, while the work also discusses ethical considerations and future directions.

Abstract

Recent advances in generative models have achieved high-fidelity in 3D human reconstruction, yet their utility for specific tasks (e.g., human 3D segmentation) remains constrained. We propose HumanCrafter, a unified framework that enables the joint modeling of appearance and human-part semantics from a single image in a feed-forward manner. Specifically, we integrate human geometric priors in the reconstruction stage and self-supervised semantic priors in the segmentation stage. To address labeled 3D human datasets scarcity, we further develop an interactive annotation procedure for generating high-quality data-label pairs. Our pixel-aligned aggregation enables cross-task synergy, while the multi-task objective simultaneously optimizes texture modeling fidelity and semantic consistency. Extensive experiments demonstrate that HumanCrafter surpasses existing state-of-the-art methods in both 3D human-part segmentation and 3D human reconstruction from a single image.

Paper Structure

This paper contains 25 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: We introduce HumanCrafter, a unified framework for simultaneous human 3D reconstruction and body-part segmentation from single images. HumanCrafter introduces explicit 3D Gaussian VersatileSplats, showcasing enhanced performance over foundation models in delivering 3D-consistent segmentation outcomes. This breakthrough offers significant advantages for downstream applications.
  • Figure 2: The network architecture of HumanCrafter. The proposed method fully utilizes 2D diffusion priors and human body geometry features to regress pixel-aligned point maps via a generic Transformer (Sec. \ref{['feat_agg']}). Subsequently, another Transformer (Sec. \ref{['mechanism']}) employs an attention mechanism to produce a set of semantic 3D Gaussians that encapsulate geometric, appearance, and semantic information. The entire pipeline is trained in an end-to-end manner by minimizing a loss function (Sec. \ref{['sec:Objective']}) that compares the predicted outputs against ground truth data and rasterized label maps from novel viewpoints.
  • Figure 3: Qualitative Results and Comparisons on Human 3D Segmentation on THuman2.1 and 2K2K datasets. HumanCrafter achieves the best precise segmentation results in terms of 3D consistency.
  • Figure 4: Novel-view images rendered by HumanCrafter and the state-of-the-art baselines on various datasets. Our method achieves the highest rendering quality. Please refer to the zoomed-in regions for details.
  • Figure 5: Ablation of Pixel-Align Aggregation. HumanCrafter with PA$^2$ can leverage knowledge learned from novel-view synthesis task and incorporate a pre-trained 2D model, thereby boosting semantic tasks.
  • ...and 3 more figures