Table of Contents
Fetching ...

SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction

Zechuan Zhang, Zongxin Yang, Yi Yang

TL;DR

SIFU tackles single-image clothed human reconstruction by introducing a Side-view Conditioned Implicit Function that uses SMPL-X normals as cross-attention queries to decouple side-view features during 2D-to-3D mapping, yielding robust geometry under challenging poses. Complementing this, a 3D Consistent Texture Refinement pipeline leverages diffusion priors and cross-view consistency to generate realistic textures for unseen views, while preserving texture coherence across perspectives. The approach achieves state-of-the-art geometry and texture quality on THuman2.0 and CAPE, demonstrates strong robustness to SMPL-X estimation errors, and supports real-world applications such as 3D printing and scene construction. By integrating explicit human priors with diffusion-based texture priors and a hybrid feature fusion strategy, SIFU delivers practical, high-fidelity clothed-human reconstructions from monocular images with broad real-world impact.

Abstract

Creating high-quality 3D models of clothed humans from single images for real-world applications is crucial. Despite recent advancements, accurately reconstructing humans in complex poses or with loose clothing from in-the-wild images, along with predicting textures for unseen areas, remains a significant challenge. A key limitation of previous methods is their insufficient prior guidance in transitioning from 2D to 3D and in texture prediction. In response, we introduce SIFU (Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction), a novel approach combining a Side-view Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU employs a cross-attention mechanism within the transformer, using SMPL-X normals as queries to effectively decouple side-view features in the process of mapping 2D features to 3D. This method not only improves the precision of the 3D models but also their robustness, especially when SMPL-X estimates are not perfect. Our texture refinement process leverages text-to-image diffusion-based prior to generate realistic and consistent textures for invisible views. Through extensive experiments, SIFU surpasses SOTA methods in both geometry and texture reconstruction, showcasing enhanced robustness in complex scenarios and achieving an unprecedented Chamfer and P2S measurement. Our approach extends to practical applications such as 3D printing and scene building, demonstrating its broad utility in real-world scenarios. Project page https://river-zhang.github.io/SIFU-projectpage/ .

SIFU: Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction

TL;DR

SIFU tackles single-image clothed human reconstruction by introducing a Side-view Conditioned Implicit Function that uses SMPL-X normals as cross-attention queries to decouple side-view features during 2D-to-3D mapping, yielding robust geometry under challenging poses. Complementing this, a 3D Consistent Texture Refinement pipeline leverages diffusion priors and cross-view consistency to generate realistic textures for unseen views, while preserving texture coherence across perspectives. The approach achieves state-of-the-art geometry and texture quality on THuman2.0 and CAPE, demonstrates strong robustness to SMPL-X estimation errors, and supports real-world applications such as 3D printing and scene construction. By integrating explicit human priors with diffusion-based texture priors and a hybrid feature fusion strategy, SIFU delivers practical, high-fidelity clothed-human reconstructions from monocular images with broad real-world impact.

Abstract

Creating high-quality 3D models of clothed humans from single images for real-world applications is crucial. Despite recent advancements, accurately reconstructing humans in complex poses or with loose clothing from in-the-wild images, along with predicting textures for unseen areas, remains a significant challenge. A key limitation of previous methods is their insufficient prior guidance in transitioning from 2D to 3D and in texture prediction. In response, we introduce SIFU (Side-view Conditioned Implicit Function for Real-world Usable Clothed Human Reconstruction), a novel approach combining a Side-view Decoupling Transformer with a 3D Consistent Texture Refinement pipeline.SIFU employs a cross-attention mechanism within the transformer, using SMPL-X normals as queries to effectively decouple side-view features in the process of mapping 2D features to 3D. This method not only improves the precision of the 3D models but also their robustness, especially when SMPL-X estimates are not perfect. Our texture refinement process leverages text-to-image diffusion-based prior to generate realistic and consistent textures for invisible views. Through extensive experiments, SIFU surpasses SOTA methods in both geometry and texture reconstruction, showcasing enhanced robustness in complex scenarios and achieving an unprecedented Chamfer and P2S measurement. Our approach extends to practical applications such as 3D printing and scene building, demonstrating its broad utility in real-world scenarios. Project page https://river-zhang.github.io/SIFU-projectpage/ .
Paper Structure (18 sections, 13 equations, 17 figures, 6 tables)

This paper contains 18 sections, 13 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: With just a single image, SIFU is capable of reconstructing a high-quality 3D clothed human model, making it well-suited for practical applications such as 3D printing and scene creation. At the heart of SIFU is a novel Side-view Conditioned Implicit Function, which is key to enhancing feature extraction and geometric precision. Furthermore, SIFU introduces a 3D Consistent Texture Refinement process, greatly improving texture quality and facilitating texture editing with the help of text-to-image diffusion models. Notably proficient in dealing with complex poses and loose clothing, SIFU stands out as an ideal solution for real-world applications.
  • Figure 2: Contrast between previous methods (Left) and ours (Right): Our approach improves the reconstruction process by incorporating additional guidance on geometry and texture priors.
  • Figure 3: Comparison of SIFU with State-of-the-Art (SOTA) Methods in 3D Human Inference from In-the-Wild Images. Existing SOTA methods often struggle with complex poses and loose clothing, leading to a range of artifacts. These issues include the absence of human shapes (PIFu, PaMIR, PIFuHD), missing body parts (ECON), disrupted clothing (ICON, D-IF), and a lack of fine details (GTA). In contrast, SIFU effectively addresses these challenges, delivering high-quality, detailed results.
  • Figure 4: Given a single image, SIFU constructs a 3D clothed human mesh with coarse textures using a Side-view Conditioned Implicit Function (§\ref{['sec:side-view implicit']}). This is followed by a step of 3D Consistent Texture Refinement (§\ref{['sec:texture refinement']}) to generate detailed textures. Specifically, SIFU employs a side-view decoupling transformer to decouple features from the input image and the side-view normals of the SMPL-X model. Then, these features are combined at a query point through a hybrid prior fusion strategy, aiding in the reconstruction of both the mesh and its texture. Finally, the mesh with its basic textures undergoes a diffusion-based 3D consistent texture refinement, ensuring feature consistency in the latent space and resulting in high-quality textures.
  • Figure 5: Texture comparison against SOTAs (§\ref{['sec:evaluation']}). We quantitatively and qualitatively compare texture quality on THuman2.0 THuman2.0:2021. PIXIE PIXIE:2021 used for SMPL-X estimation during testing. Please zoom in for details.
  • ...and 12 more figures