MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos
Daisheng Jin, Ying He
TL;DR
MonoCloth tackles reconstructing clothed human avatars from monocular videos by decoupling the body and clothing into specialized components and modeling garment dynamics with a dedicated CloSim module. It builds on SMPL-X with per-component 3D Gaussian representations and uses FLAME/MANO for fine-grained facial and hand detail, while CloSim leverages a GCN and GRU to capture spatial and temporal garment motion. The model is trained in two stages to learn broad priors across subjects and then adapt to a target identity, achieving state-of-the-art reconstruction and animation quality on NeuMan and X-Humans, including clothing transfer. The approach enables realistic novel-view synthesis and practical applications such as virtual try-on from monocular video.
Abstract
Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.
