SCULPT: Shape-Conditioned Unpaired Learning of Pose-dependent Clothed and Textured Human Meshes
Soubhik Sanyal, Partha Ghosh, Jinlong Yang, Michael J. Black, Justus Thies, Timo Bolkart
TL;DR
SCULPT addresses the challenge of generating realistic, clothed, textured 3D human meshes from unpaired data by decoupling geometry and appearance learning: geometry is learned from 3D scan data as pose-dependent UV-space displacements on the SMPL template, while texture is learned from large-scale 2D fashion images using a geometry-conditioned texture generator. The model comprises two StyleGAN3-based networks, G_geo for geometry and G_tex for texture, with clothing-type and color labels derived automatically via BLIP and CLIP to reduce pose-clothing entanglement, and G_tex further conditioned on intermediate features from G_geo to enforce geometry–texture coherence. Training proceeds in two stages with a dual-discriminator setup and differentiable rendering, producing explicit meshes and texture maps that are articulation-ready and suitable for standard graphics pipelines. Compared to state-of-the-art 3D clothed-human methods, SCULPT offers improved geometry quality, richer texture details, and explicit semantic controls over clothing type and color, enabling flexible, render-ready 3D human generation for games, films, and synthetic data creation.
Abstract
We present SCULPT, a novel 3D generative model for clothed and textured 3D meshes of humans. Specifically, we devise a deep neural network that learns to represent the geometry and appearance distribution of clothed human bodies. Training such a model is challenging, as datasets of textured 3D meshes for humans are limited in size and accessibility. Our key observation is that there exist medium-sized 3D scan datasets like CAPE, as well as large-scale 2D image datasets of clothed humans and multiple appearances can be mapped to a single geometry. To effectively learn from the two data modalities, we propose an unpaired learning procedure for pose-dependent clothed and textured human meshes. Specifically, we learn a pose-dependent geometry space from 3D scan data. We represent this as per vertex displacements w.r.t. the SMPL model. Next, we train a geometry conditioned texture generator in an unsupervised way using the 2D image data. We use intermediate activations of the learned geometry model to condition our texture generator. To alleviate entanglement between pose and clothing type, and pose and clothing appearance, we condition both the texture and geometry generators with attribute labels such as clothing types for the geometry, and clothing colors for the texture generator. We automatically generated these conditioning labels for the 2D images based on the visual question answering model BLIP and CLIP. We validate our method on the SCULPT dataset, and compare to state-of-the-art 3D generative models for clothed human bodies. Our code and data can be found at https://sculpt.is.tue.mpg.de.
