Table of Contents
Fetching ...

Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image

Joohyun Kwon, Geonhee Sim, Gyeongsik Moon

Abstract

Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.

Zero-Shot Reconstruction of Animatable 3D Avatars with Cloth Dynamics from a Single Image

Abstract

Existing single-image 3D human avatar methods primarily rely on rigid joint transformations, limiting their ability to model realistic cloth dynamics. We present DynaAvatar, a zero-shot framework that reconstructs animatable 3D human avatars with motion-dependent cloth dynamics from a single image. Trained on large-scale multi-person motion datasets, DynaAvatar employs a Transformer-based feed-forward architecture that directly predicts dynamic 3D Gaussian deformations without subject-specific optimization. To overcome the scarcity of dynamic captures, we introduce a static-to-dynamic knowledge transfer strategy: a Transformer pretrained on large-scale static captures provides strong geometric and appearance priors, which are efficiently adapted to motion-dependent deformations through lightweight LoRA fine-tuning on dynamic captures. We further propose the DynaFlow loss, an optical flow-guided objective that provides reliable motion-direction geometric cues for cloth dynamics in rendered space. Finally, we reannotate the missing or noisy SMPL-X fittings in existing dynamic capture datasets, as most public dynamic capture datasets contain incomplete or unreliable fittings that are unsuitable for training high-quality 3D avatar reconstruction models. Experiments demonstrate that DynaAvatar produces visually rich and generalizable animations, outperforming prior methods.
Paper Structure (28 sections, 13 figures, 6 tables)

This paper contains 28 sections, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Comparison between LHM qiu2025LHM and our DynaAvatar on both mild and fast motions. Unlike prior single-image-based methods, DynaAvatar can reconstruct animatable 3D human avatars that exhibit motion-dependent cloth dynamics.
  • Figure 2: Overall pipeline of the proposed DynaAvatar. We first extract detailed geometry and appearance without cloth dynamics using a Static Transformer. Next, cloth dynamics are incorporated from motion history through a Dynamic Transformer. The final 3D avatar in canonical space is reconstructed using a Gaussian decoder and then animated and rendered with LBS and a 3DGS renderer. Since the canonical avatar already encodes motion-dependent cloth dynamics, the animation produced by LBS faithfully maintains these dynamics.
  • Figure 3: Visualization of the proposed DynaFlow loss. Our DynaFlow loss encourages the Gaussians at source locations (black-outlined white circles) to move toward the endpoints of the estimated flow vectors.
  • Figure 4: Comparison between (b) the original annotations and (c) our reannotations for the DNA-Rendering cheng2023dna (top) and Actors-HQ icsik2023humanrf (bottom) datasets.
  • Figure 5: Effectiveness of our Dynamic Transformer.
  • ...and 8 more figures