GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text
Gyumin Shim, Sangmin Lee, Jaegul Choo
TL;DR
GaussianMotion addresses the challenge of generating animatable 3D human avatars from textual descriptions by integrating deformable Gaussian Splatting with pose-guided score distillation. It densely samples random poses during training and introduces Adaptive Score Distillation to balance realistic detail and smoothness, enabling high-fidelity, pose-consistent renderings across arbitrary motions. The method learns residual skinning weights on top of SMPL, allows pose-conditioned distillation via ControlNet, and uses a scale regularization to maintain geometric detail. Experimental results show superior texture and geometry quality in static and animated scenarios, with strong quantitative and user-study support. This work offers a scalable, efficient pathway to text-driven, animatable 3D avatars for VR, metaverse, and related applications.
Abstract
In this paper, we introduce GaussianMotion, a novel human rendering model that generates fully animatable scenes aligned with textual descriptions using Gaussian Splatting. Although existing methods achieve reasonable text-to-3D generation of human bodies using various 3D representations, they often face limitations in fidelity and efficiency, or primarily focus on static models with limited pose control. In contrast, our method generates fully animatable 3D avatars by combining deformable 3D Gaussian Splatting with text-to-3D score distillation, achieving high fidelity and efficient rendering for arbitrary poses. By densely generating diverse random poses during optimization, our deformable 3D human model learns to capture a wide range of natural motions distilled from a pose-conditioned diffusion model in an end-to-end manner. Furthermore, we propose Adaptive Score Distillation that effectively balances realistic detail and smoothness to achieve optimal 3D results. Experimental results demonstrate that our approach outperforms existing baselines by producing high-quality textures in both static and animated results, and by generating diverse 3D human models from various textual inputs.
