Table of Contents
Fetching ...

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

Hsuan-I Ho, Jie Song, Otmar Hilliges

TL;DR

SiTH presents a two-stage framework for single-view textured human reconstruction that first hallucinates a back-view conditioned on a front-view image using an image-conditioned diffusion model, then reconstructs a full-body textured mesh guided by both views and a skinned-body prior. By training on a compact data regime (roughly 500 THuman2.0 scans) and introducing CustomHumans as a higher-quality benchmark, SiTH achieves perceptually realistic back-view details and accurate geometry at runtimes under two minutes, outperforming several optimization- and diffusion-based baselines. The approach integrates conditioning signals from CLIP and VAE, UV maps, and silhouette masks, and leverages pixel-aligned features with a local body prior to resolve depth ambiguity in reconstruction. Its demonstrated robustness to unseen inputs and compatibility with generative diffusion workflows enables practical, fast 3D human creation from simple images, with notable potential for AI-assisted content generation. Overall, SiTH advances single-view 3D human reconstruction by effectively combining generative back-view hallucination with data-driven mesh reconstruction, delivering high-quality textured humans efficiently and robustly.

Abstract

A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes, appearances, and clothing details in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the challenging single-view reconstruction problem into generative hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. SiTH requires as few as 500 3D human scans for training while maintaining its generality and robustness to diverse images. Extensive evaluations on two 3D human benchmarks, including our newly created one, highlighted our method's superior accuracy and perceptual quality in 3D textured human reconstruction. Our code and evaluation benchmark are available at https://ait.ethz.ch/sith

SiTH: Single-view Textured Human Reconstruction with Image-Conditioned Diffusion

TL;DR

SiTH presents a two-stage framework for single-view textured human reconstruction that first hallucinates a back-view conditioned on a front-view image using an image-conditioned diffusion model, then reconstructs a full-body textured mesh guided by both views and a skinned-body prior. By training on a compact data regime (roughly 500 THuman2.0 scans) and introducing CustomHumans as a higher-quality benchmark, SiTH achieves perceptually realistic back-view details and accurate geometry at runtimes under two minutes, outperforming several optimization- and diffusion-based baselines. The approach integrates conditioning signals from CLIP and VAE, UV maps, and silhouette masks, and leverages pixel-aligned features with a local body prior to resolve depth ambiguity in reconstruction. Its demonstrated robustness to unseen inputs and compatibility with generative diffusion workflows enables practical, fast 3D human creation from simple images, with notable potential for AI-assisted content generation. Overall, SiTH advances single-view 3D human reconstruction by effectively combining generative back-view hallucination with data-driven mesh reconstruction, delivering high-quality textured humans efficiently and robustly.

Abstract

A long-standing goal of 3D human reconstruction is to create lifelike and fully detailed 3D humans from single-view images. The main challenge lies in inferring unknown body shapes, appearances, and clothing details in areas not visible in the images. To address this, we propose SiTH, a novel pipeline that uniquely integrates an image-conditioned diffusion model into a 3D mesh reconstruction workflow. At the core of our method lies the decomposition of the challenging single-view reconstruction problem into generative hallucination and reconstruction subproblems. For the former, we employ a powerful generative diffusion model to hallucinate unseen back-view appearance based on the input images. For the latter, we leverage skinned body meshes as guidance to recover full-body texture meshes from the input and back-view images. SiTH requires as few as 500 3D human scans for training while maintaining its generality and robustness to diverse images. Extensive evaluations on two 3D human benchmarks, including our newly created one, highlighted our method's superior accuracy and perceptual quality in 3D textured human reconstruction. Our code and evaluation benchmark are available at https://ait.ethz.ch/sith
Paper Structure (51 sections, 8 equations, 22 figures, 8 tables)

This paper contains 51 sections, 8 equations, 22 figures, 8 tables.

Figures (22)

  • Figure 1: Method overview. SiTH is a two-stage pipeline composed of back-view hallucination and mesh reconstruction. The back-view hallucination module samples perceptually consistent back-view images through an iterative denoising process conditioned on the input image, UV map, and silhouette mask (\ref{['sec:diffusion']}). Based on the input and generated back-view images, the mesh reconstruction module recovers a full-body mesh and textures leveraging a skinned body prior as guidance (\ref{['sec:mesh']}). Note that both modules in the pipeline can be trained with the same public 3D human dataset and generalize unseen images.
  • Figure 2: Training of back-view hallucination module. We employ a pretrained LDM and ControlNet architecture to enable image conditioning. To train our model, we render training pairs of conditional images $I^F$ and ground-truth images $I^B$ from 3D human scans. Given a noisy image latent $z_t$, the model predicts added noise $\epsilon$ given the conditional image $I^F$, UV map $I^B_{UV}$, and mask $I^B_{M}$ as conditions. We train the ControlNet model and cross-attention layers while keeping other parameters frozen.
  • Figure 3: Mesh reconstruction module. Given front and back-view images ($I^F,I^B$) we predict their normal images ($N^F,N^B$) through a learned normal predictor. A 3D point $\mathbf{x}$ is projected onto these images for querying pixel-aligned features ($\mathbf{f}_{d,x},\mathbf{f}_{r,x}$). To leverage human body mesh as guidance, we embed the point $\mathbf{x}$ into the local UV coordinates $\mathbf{u}_c$, vector $\mathbf{n_c}$, distance $d_c$, and visibility $v_c$. Finally, two decoders ($H_d,H_r$) predict SDF and RGB values at $\mathbf{x}$ given the positional embedding and pixel-aligned features.
  • Figure 4: Qualitative comparison on CustomHumans. Top: Results of methods generating mesh and texture. Bottom: Results of methods generating mesh only. Note that single-view reconstruction is not possible to replicate exact back-view texture and geometry. Our method generates realistic texture and clothing wrinkles perceptually close to the real scans while other baselines only produce smooth colors and surfaces in the back regions. Best viewed in color and zoom in.
  • Figure 5: Qualitative comparison of back-view hallucination. We visualize back-view images generated by the baseline methods. Note that the three different images are sampled from different random seeds. Our results are perceptually close to the ground-truth image in terms of appearances and poses. Moreover, our method also preserves generative stochasticity for handling tiny wrinkle changes.
  • ...and 17 more figures