Table of Contents
Fetching ...

GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text

Gyumin Shim, Sangmin Lee, Jaegul Choo

TL;DR

GaussianMotion addresses the challenge of generating animatable 3D human avatars from textual descriptions by integrating deformable Gaussian Splatting with pose-guided score distillation. It densely samples random poses during training and introduces Adaptive Score Distillation to balance realistic detail and smoothness, enabling high-fidelity, pose-consistent renderings across arbitrary motions. The method learns residual skinning weights on top of SMPL, allows pose-conditioned distillation via ControlNet, and uses a scale regularization to maintain geometric detail. Experimental results show superior texture and geometry quality in static and animated scenarios, with strong quantitative and user-study support. This work offers a scalable, efficient pathway to text-driven, animatable 3D avatars for VR, metaverse, and related applications.

Abstract

In this paper, we introduce GaussianMotion, a novel human rendering model that generates fully animatable scenes aligned with textual descriptions using Gaussian Splatting. Although existing methods achieve reasonable text-to-3D generation of human bodies using various 3D representations, they often face limitations in fidelity and efficiency, or primarily focus on static models with limited pose control. In contrast, our method generates fully animatable 3D avatars by combining deformable 3D Gaussian Splatting with text-to-3D score distillation, achieving high fidelity and efficient rendering for arbitrary poses. By densely generating diverse random poses during optimization, our deformable 3D human model learns to capture a wide range of natural motions distilled from a pose-conditioned diffusion model in an end-to-end manner. Furthermore, we propose Adaptive Score Distillation that effectively balances realistic detail and smoothness to achieve optimal 3D results. Experimental results demonstrate that our approach outperforms existing baselines by producing high-quality textures in both static and animated results, and by generating diverse 3D human models from various textual inputs.

GaussianMotion: End-to-End Learning of Animatable Gaussian Avatars with Pose Guidance from Text

TL;DR

GaussianMotion addresses the challenge of generating animatable 3D human avatars from textual descriptions by integrating deformable Gaussian Splatting with pose-guided score distillation. It densely samples random poses during training and introduces Adaptive Score Distillation to balance realistic detail and smoothness, enabling high-fidelity, pose-consistent renderings across arbitrary motions. The method learns residual skinning weights on top of SMPL, allows pose-conditioned distillation via ControlNet, and uses a scale regularization to maintain geometric detail. Experimental results show superior texture and geometry quality in static and animated scenarios, with strong quantitative and user-study support. This work offers a scalable, efficient pathway to text-driven, animatable 3D avatars for VR, metaverse, and related applications.

Abstract

In this paper, we introduce GaussianMotion, a novel human rendering model that generates fully animatable scenes aligned with textual descriptions using Gaussian Splatting. Although existing methods achieve reasonable text-to-3D generation of human bodies using various 3D representations, they often face limitations in fidelity and efficiency, or primarily focus on static models with limited pose control. In contrast, our method generates fully animatable 3D avatars by combining deformable 3D Gaussian Splatting with text-to-3D score distillation, achieving high fidelity and efficient rendering for arbitrary poses. By densely generating diverse random poses during optimization, our deformable 3D human model learns to capture a wide range of natural motions distilled from a pose-conditioned diffusion model in an end-to-end manner. Furthermore, we propose Adaptive Score Distillation that effectively balances realistic detail and smoothness to achieve optimal 3D results. Experimental results demonstrate that our approach outperforms existing baselines by producing high-quality textures in both static and animated results, and by generating diverse 3D human models from various textual inputs.

Paper Structure

This paper contains 17 sections, 8 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Examples of 3D human models generated by GaussianMotion. Our method is able to generate high-quality Gaussian-based avatars from text and render animated scenes from user-specified pose inputs.
  • Figure 2: Overview of our proposed framework. Given a text prompt as input, we generate animatable 3D humans by modeling deformable Gaussian Splatting, where Gaussian points adapt their positions based on input poses. The points are defined in a canonical space and shared across different poses (observation spaces). Random poses are sampled to deform the Gaussian points and rendered as pose images to provide pose-aware guidance for the rendered images $\mathbf{x}$ through score distillation. After optimizing the Gaussian points to reflect the appearances described by the text prompt, fully animatable scenes are rendered based on user-specified input poses during inference.
  • Figure 3: Qualitative comparison of 3D human models in a static A-pose. We evaluate our approach against recent state-of-the-art baselines using different prompts. For each method, two images are rendered from frontal and side views, respectively.
  • Figure 4: Qualitative comparison of 3D human models in animated scenes. We evaluate our approach against recent state-of-the-art baselines in a one-to-one manner. For each method, four images are rendered in different poses corresponding to each text prompt.
  • Figure 5: Ablation studies on pose guidance. We present rendered images from 3D models trained with and without pose guidance, with the input pose shown in the first column. Additionally, we show generated images sampled from noised rendered images, with and without pose conditioning.
  • ...and 2 more figures