Table of Contents
Fetching ...

RealityAvatar: Towards Realistic Loose Clothing Modeling in Animatable 3D Gaussian Avatars

Yahui Li, Zhi Zeng, Liming Pang, Guixuan Zhang, Shuwu Zhang

TL;DR

RealityAvatar tackles the challenge of accurately modeling dynamic loose clothing on animatable 3D human avatars. It introduces a 3D Gaussian Splatting framework augmented with a motion trend module (LSTM-based temporal modeling) and a latentbone encoder (region-based pose encoding with a clothing latent code) to capture pose-dependent deformations and temporal variations. The method maps canonical Gaussians to observation space via a skinning-based transformation and renders with a differentiable Gaussian pipeline that incorporates motion cues and per-frame lighting changes. Extensive experiments on I3D-Human and ZJU-Mocap show state-of-the-art novel-view and novel-pose performance, improved temporal coherence, and better handling of non-rigid clothing dynamics, with competitive training efficiency. Overall, RealityAvatar advances high-fidelity, temporally consistent digital humans with loose clothing for applications in virtual production and realtime avatars.

Abstract

Modeling animatable human avatars from monocular or multi-view videos has been widely studied, with recent approaches leveraging neural radiance fields (NeRFs) or 3D Gaussian Splatting (3DGS) achieving impressive results in novel-view and novel-pose synthesis. However, existing methods often struggle to accurately capture the dynamics of loose clothing, as they primarily rely on global pose conditioning or static per-frame representations, leading to oversmoothing and temporal inconsistencies in non-rigid regions. To address this, We propose RealityAvatar, an efficient framework for high-fidelity digital human modeling, specifically targeting loosely dressed avatars. Our method leverages 3D Gaussian Splatting to capture complex clothing deformations and motion dynamics while ensuring geometric consistency. By incorporating a motion trend module and a latentbone encoder, we explicitly model pose-dependent deformations and temporal variations in clothing behavior. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach in capturing fine-grained clothing deformations and motion-driven shape variations. Our method significantly enhances structural fidelity and perceptual quality in dynamic human reconstruction, particularly in non-rigid regions, while achieving better consistency across temporal frames.

RealityAvatar: Towards Realistic Loose Clothing Modeling in Animatable 3D Gaussian Avatars

TL;DR

RealityAvatar tackles the challenge of accurately modeling dynamic loose clothing on animatable 3D human avatars. It introduces a 3D Gaussian Splatting framework augmented with a motion trend module (LSTM-based temporal modeling) and a latentbone encoder (region-based pose encoding with a clothing latent code) to capture pose-dependent deformations and temporal variations. The method maps canonical Gaussians to observation space via a skinning-based transformation and renders with a differentiable Gaussian pipeline that incorporates motion cues and per-frame lighting changes. Extensive experiments on I3D-Human and ZJU-Mocap show state-of-the-art novel-view and novel-pose performance, improved temporal coherence, and better handling of non-rigid clothing dynamics, with competitive training efficiency. Overall, RealityAvatar advances high-fidelity, temporally consistent digital humans with loose clothing for applications in virtual production and realtime avatars.

Abstract

Modeling animatable human avatars from monocular or multi-view videos has been widely studied, with recent approaches leveraging neural radiance fields (NeRFs) or 3D Gaussian Splatting (3DGS) achieving impressive results in novel-view and novel-pose synthesis. However, existing methods often struggle to accurately capture the dynamics of loose clothing, as they primarily rely on global pose conditioning or static per-frame representations, leading to oversmoothing and temporal inconsistencies in non-rigid regions. To address this, We propose RealityAvatar, an efficient framework for high-fidelity digital human modeling, specifically targeting loosely dressed avatars. Our method leverages 3D Gaussian Splatting to capture complex clothing deformations and motion dynamics while ensuring geometric consistency. By incorporating a motion trend module and a latentbone encoder, we explicitly model pose-dependent deformations and temporal variations in clothing behavior. Extensive experiments on benchmark datasets demonstrate the effectiveness of our approach in capturing fine-grained clothing deformations and motion-driven shape variations. Our method significantly enhances structural fidelity and perceptual quality in dynamic human reconstruction, particularly in non-rigid regions, while achieving better consistency across temporal frames.

Paper Structure

This paper contains 14 sections, 12 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: The overall pipeline of our method. The human body is initialized in canonical space based on the SMPL model. Using the motion trend module and human transformation, we deform the 3D Gaussian representation from canonical space to observation space. To extract meaningful pose features, we introduce the latentbone encoder, which captures skeletal representations that influence clothing dynamics. In the motion trend module, a recurrent neural network is employed to encode temporal dependencies, modeling different dynamic effects of clothing under similar poses. These components work together to achieve realistic motion-driven human reconstruction.
  • Figure 2: Qualitative comparison on the I3D-Human dataset. Top: Novel View Comparison. Bottom: Novel Pose Comparison. Note that in terms of novel view generation, some baseline generation graphs are quoted from Dyco chen2024within. For novel pose comparison, we focus on comparing with the Dyco chen2024within, as this method also specializes in loose clothing human body modeling and driving.
  • Figure 3: Ablation Study on I3D-Human Dataset.