Table of Contents
Fetching ...

MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

Gangjian Zhang, Nanjie Yao, Shunsi Zhang, Hanfeng Zhao, Guoliang Pang, Jian Shu, Hao Wang

TL;DR

MultiGO tackles monocular 3D textured human reconstruction by introducing a multi-level geometry learning framework that leverages SMPL-X priors and a Gaussian-based representation. It contributes three modules—Skeleton-Level Enhancement, Joint-Level Augmentation, and Wrinkle-Level Refinement—to jointly improve pose accuracy and fine-grained details such as wrinkles, while coupling 3D Fourier features with 2D image information for robust skeleton modeling. A diffusion-inspired refinement process further enhances mesh wrinkles, yielding high-quality geometry and texture as demonstrated on CustomHuman and THuman3.0, where MultiGO achieves state-of-the-art results. The approach offers a practical, efficient path to realistic 3D human reconstructions from single-view imagery, with strong generalization to out-of-distribution data and detailed texture preservation tied to geometric fidelity.

Abstract

This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.

MultiGO: Towards Multi-level Geometry Learning for Monocular 3D Textured Human Reconstruction

TL;DR

MultiGO tackles monocular 3D textured human reconstruction by introducing a multi-level geometry learning framework that leverages SMPL-X priors and a Gaussian-based representation. It contributes three modules—Skeleton-Level Enhancement, Joint-Level Augmentation, and Wrinkle-Level Refinement—to jointly improve pose accuracy and fine-grained details such as wrinkles, while coupling 3D Fourier features with 2D image information for robust skeleton modeling. A diffusion-inspired refinement process further enhances mesh wrinkles, yielding high-quality geometry and texture as demonstrated on CustomHuman and THuman3.0, where MultiGO achieves state-of-the-art results. The approach offers a practical, efficient path to realistic 3D human reconstructions from single-view imagery, with strong generalization to out-of-distribution data and detailed texture preservation tied to geometric fidelity.

Abstract

This paper investigates the research task of reconstructing the 3D clothed human body from a monocular image. Due to the inherent ambiguity of single-view input, existing approaches leverage pre-trained SMPL(-X) estimation models or generative models to provide auxiliary information for human reconstruction. However, these methods capture only the general human body geometry and overlook specific geometric details, leading to inaccurate skeleton reconstruction, incorrect joint positions, and unclear cloth wrinkles. In response to these issues, we propose a multi-level geometry learning framework. Technically, we design three key components: skeleton-level enhancement, joint-level augmentation, and wrinkle-level refinement modules. Specifically, we effectively integrate the projected 3D Fourier features into a Gaussian reconstruction model, introduce perturbations to improve joint depth estimation during training, and refine the human coarse wrinkles by resembling the de-noising process of diffusion model. Extensive quantitative and qualitative experiments on two out-of-distribution test sets show the superior performance of our approach compared to state-of-the-art (SOTA) methods.

Paper Structure

This paper contains 19 sections, 6 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Comparisons with the SOTA Methods in Monocular 3D Textured Human Reconstruction. Existing SOTA methods struggle with recovering correct human poses and intricate geometry details. SiFU Zhang_2024_sifu is unable to reconstruct correct human postures, such as incorrect left-hand positions. VS VS_CVPR2024 performs poorly in fine-grained areas such as unclear finger movement and cloth wrinkles. SiTH ho2024sith produces geometry and texture errors that occur from the generative model, such as the third arm on the back.
  • Figure 2: Method Overview. Our method, MultiGO, addresses monocular textured 3D human reconstruction by introducing a multi-level geometry learning framework that significantly enhances reconstruction quality. To accurately capture the human body's posture, we propose the SLE module, which projects 3D Fourier features into the 2D space of the input image, allowing the Gaussian reconstruction model to fully utilize prior human shape knowledge. For improved depth estimation of human joints, the JLA strategy applies controlled perturbations during training, increasing the model's robustness to depth inaccuracies during inference. To refine geometric details like body wrinkles, the WLR module resembles the final de-noising steps in diffusion theory, treating coarse meshes as Gaussian noise and using the high-quality texture of reconstructed Gaussian as conditions to refine wrinkles.
  • Figure 3: Skeleton-Level Enhancement Module. To enhance the human geometry at the skeleton level, we achieve better fusion of the heterogeneous modalities of the 3D SMPL-X body prior and 2D images. We propose interpolating the Fourier features of 3D occluded points and mapping them from three different angles into the same 2D space as the image features.
  • Figure 4: Joint-Level Augmentation Strategy. To enhance human geometry at the joint level, we augment the samples of input human SMPL-X body mesh during training. We propose to randomly perturb ground truth SMPL-X parameters associated with specific joints to increase the model robustness in inference.
  • Figure 5: Wrinkle-Level Refinement Module. To improve the human geometry at the wrinkle level, we equate the refinement process with the last few steps of the de-noising process in the diffusion model and use a fixed number of "de-noising" steps to achieve the refined mesh predicting from initialized coarse mesh.
  • ...and 2 more figures