Table of Contents
Fetching ...

MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration

Nanjie Yao, Gangjian Zhang, Wenhao Shen, Jian Shu, Yu Feng, Hao Wang

TL;DR

This work proposes a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration and shows the superiority of this method over state-of-the-art approaches.

Abstract

Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.

MultiGO++: Monocular 3D Clothed Human Reconstruction via Geometry-Texture Collaboration

TL;DR

This work proposes a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration and shows the superiority of this method over state-of-the-art approaches.

Abstract

Monocular 3D clothed human reconstruction aims to generate a complete and realistic textured 3D avatar from a single image. Existing methods are commonly trained under multi-view supervision with annotated geometric priors, and during inference, these priors are estimated by the pre-trained network from the monocular input. These methods are constrained by three key limitations: texturally by unavailability of training data, geometrically by inaccurate external priors, and systematically by biased single-modality supervision, all leading to suboptimal reconstruction. To address these issues, we propose a novel reconstruction framework, named MultiGO++, which achieves effective systematic geometry-texture collaboration. It consists of three core parts: (1) A multi-source texture synthesis strategy that constructs 15,000+ 3D textured human scans to improve the performance on texture quality estimation in challenge scenarios; (2) A region-aware shape extraction module that extracts and interacts features of each body region to obtain geometry information and a Fourier geometry encoder that mitigates the modality gap to achieve effective geometry learning; (3) A dual reconstruction U-Net that leverages geometry-texture collaborative features to refine and generate high-fidelity textured 3D human meshes. Extensive experiments on two benchmarks and many in-the-wild cases show the superiority of our method over state-of-the-art approaches.
Paper Structure (12 sections, 4 equations, 10 figures, 5 tables)

This paper contains 12 sections, 4 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Monocular 3D human reconstruction on challenge in-the-wild cases. The proposed MultiGO++ exhibits strong generalization and robustness, even in these difficult in-the-wild cases, such as those shown in the figure.
  • Figure 2: Method Overview. Our framework integrates three core components: Texturally, we employ a multi-source texture synthesis strategy to generate diverse synthetic data for training, along with a lightweight texture encoder for effective feature extraction. Geometrically, we introduce a Region-aware Shape Extraction Module that enhances human shape extraction through part-based feature interaction, utilizing Self-Attention (SA), Cross-Attention (CA), and Feed-Forward Networks (FFN). This is coupled with a Fourier Geometry Encoder to bridge the modality gap for efficient geometric learning. Systematically, we propose a Dual Reconstruction U-Net that utilizes feature residuals to balance geometric and texture features, enabling mutual enhancement across modalities. Additionally, to refine 3D mesh quality and extraction efficiency, we design a Gaussian-enhanced remeshing strategy supervised by the generated normal Gaussian avatar.
  • Figure 3: Multi-source Texture Synthesis Strategy. The proposed multi-source texture synthesis strategy leverages X-to-3D models and multimodal LLM data screening to generate high-quality training data for enhanced texture estimation.
  • Figure 4: Detailed Architecture of Fourier Geometry Encoder. To achieve effective geometry learning, we achieve better fusion of the heterogeneous modalities of the 3D geometry prior and 2D images. We propose interpolating the Fourier features of 3D occluded points and mapping them from three different angles into the same 2D space as the image features.
  • Figure 5: Qualitative comparisons on in-the-wild images featuring loose clothing. While other SOTA methods struggle to accurately reconstruct the challenging geometries of loose garments, our approach faithfully reproduces high-fidelity wrinkles and intricate textures. Please zoom infor a detailed view.
  • ...and 5 more figures