Table of Contents
Fetching ...

MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos

Daisheng Jin, Ying He

TL;DR

MonoCloth tackles reconstructing clothed human avatars from monocular videos by decoupling the body and clothing into specialized components and modeling garment dynamics with a dedicated CloSim module. It builds on SMPL-X with per-component 3D Gaussian representations and uses FLAME/MANO for fine-grained facial and hand detail, while CloSim leverages a GCN and GRU to capture spatial and temporal garment motion. The model is trained in two stages to learn broad priors across subjects and then adapt to a target identity, achieving state-of-the-art reconstruction and animation quality on NeuMan and X-Humans, including clothing transfer. The approach enables realistic novel-view synthesis and practical applications such as virtual try-on from monocular video.

Abstract

Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.

MonoCloth: Reconstruction and Animation of Cloth-Decoupled Human Avatars from Monocular Videos

TL;DR

MonoCloth tackles reconstructing clothed human avatars from monocular videos by decoupling the body and clothing into specialized components and modeling garment dynamics with a dedicated CloSim module. It builds on SMPL-X with per-component 3D Gaussian representations and uses FLAME/MANO for fine-grained facial and hand detail, while CloSim leverages a GCN and GRU to capture spatial and temporal garment motion. The model is trained in two stages to learn broad priors across subjects and then adapt to a target identity, achieving state-of-the-art reconstruction and animation quality on NeuMan and X-Humans, including clothing transfer. The approach enables realistic novel-view synthesis and practical applications such as virtual try-on from monocular video.

Abstract

Reconstructing realistic 3D human avatars from monocular videos is a challenging task due to the limited geometric information and complex non-rigid motion involved. We present MonoCloth, a new method for reconstructing and animating clothed human avatars from monocular videos. To overcome the limitations of monocular input, we introduce a part-based decomposition strategy that separates the avatar into body, face, hands, and clothing. This design reflects the varying levels of reconstruction difficulty and deformation complexity across these components. Specifically, we focus on detailed geometry recovery for the face and hands. For clothing, we propose a dedicated cloth simulation module that captures garment deformation using temporal motion cues and geometric constraints. Experimental results demonstrate that MonoCloth improves both visual reconstruction quality and animation realism compared to existing methods. Furthermore, thanks to its part-based design, MonoCloth also supports additional tasks such as clothing transfer, underscoring its versatility and practical utility.

Paper Structure

This paper contains 16 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: MonoCloth reconstructs a human avatar from monocular videos by dividing it into separate components. Each component is optimized using a strategy suited to its geometric and motion characteristics, which improves reconstruction quality. The reconstructed avatar supports natural animation and can be rendered from novel viewpoints. The modular design also allows for part-level editing, such as clothing transfer.
  • Figure 2: MonoCloth pipeline.1) We first reconstruct the static geometry and appearance in the canonical pose, where Gaussian attributes are computed and decomposed into different components. 2) Combining static avatar features with multi-frame SMPL-X parameters, we incorporate both spatial and temporal information to predict motion-dependent offsets that enrich avatar details. 3) The reconstructed avatar is supervised using ground-truth RGB images, normal maps, depth maps, and more auxiliary targets to jointly optimize appearance and geometry.
  • Figure 3: Qualitative results on NeuMan. Our method achieves the highest overall visual quality, particularly in reconstructing fine clothing textures as well as facial and hand details.
  • Figure 4: X-Humans comparison. Temporal modeling improves clothing stability.
  • Figure 5: Geometry loss ablation. Geometry supervision reduces 3D artifacts.
  • ...and 1 more figures