Table of Contents
Fetching ...

$E^{3}$Gen: Efficient, Expressive and Editable Avatars Generation

Weitian Zhang, Yichao Yan, Yunhui Liu, Xingdong Sheng, Xiaokang Yang

TL;DR

E^3Gen introduces a generative UV features plane to fuse unstructured 3D Gaussians with diffusion-based avatar generation, achieving real-time high-resolution rendering while preserving editability and expressive control. A part-aware deformation module provides robust full-body pose and facial/hand expression control, enabling local editing and attribute transfer across subjects. The method employs a single-stage diffusion training with joint fitting and denoising to learn multi-subject avatars without per-subject optimization, validated on THuman2.0 with strong quantitative and qualitative results. This framework advances practical avatar creation by delivering efficient rendering, expressive animation, and versatile editing suitable for immersive VR/AR, film, and telepresence applications.

Abstract

This paper aims to introduce 3D Gaussian for efficient, expressive, and editable digital avatar generation. This task faces two major challenges: (1) The unstructured nature of 3D Gaussian makes it incompatible with current generation pipelines; (2) the expressive animation of 3D Gaussian in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we propose a novel avatar generation method named $E^3$Gen, to effectively address these challenges. First, we propose a novel generative UV features plane representation that encodes unstructured 3D Gaussian onto a structured 2D UV space defined by the SMPL-X parametric model. This novel representation not only preserves the representation ability of the original 3D Gaussian but also introduces a shared structure among subjects to enable generative learning of the diffusion model. To tackle the second challenge, we propose a part-aware deformation module to achieve robust and accurate full-body expressive pose control. Extensive experiments demonstrate that our method achieves superior performance in avatar generation and enables expressive full-body pose control and editing. Our project page is https://olivia23333.github.io/E3Gen.

$E^{3}$Gen: Efficient, Expressive and Editable Avatars Generation

TL;DR

E^3Gen introduces a generative UV features plane to fuse unstructured 3D Gaussians with diffusion-based avatar generation, achieving real-time high-resolution rendering while preserving editability and expressive control. A part-aware deformation module provides robust full-body pose and facial/hand expression control, enabling local editing and attribute transfer across subjects. The method employs a single-stage diffusion training with joint fitting and denoising to learn multi-subject avatars without per-subject optimization, validated on THuman2.0 with strong quantitative and qualitative results. This framework advances practical avatar creation by delivering efficient rendering, expressive animation, and versatile editing suitable for immersive VR/AR, film, and telepresence applications.

Abstract

This paper aims to introduce 3D Gaussian for efficient, expressive, and editable digital avatar generation. This task faces two major challenges: (1) The unstructured nature of 3D Gaussian makes it incompatible with current generation pipelines; (2) the expressive animation of 3D Gaussian in a generative setting that involves training with multiple subjects remains unexplored. In this paper, we propose a novel avatar generation method named Gen, to effectively address these challenges. First, we propose a novel generative UV features plane representation that encodes unstructured 3D Gaussian onto a structured 2D UV space defined by the SMPL-X parametric model. This novel representation not only preserves the representation ability of the original 3D Gaussian but also introduces a shared structure among subjects to enable generative learning of the diffusion model. To tackle the second challenge, we propose a part-aware deformation module to achieve robust and accurate full-body expressive pose control. Extensive experiments demonstrate that our method achieves superior performance in avatar generation and enables expressive full-body pose control and editing. Our project page is https://olivia23333.github.io/E3Gen.
Paper Structure (19 sections, 12 equations, 5 figures, 2 tables)

This paper contains 19 sections, 12 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Method Overview. Our approach utilizes a single-stage diffusion model to simultaneously train the denoising and fitting process. The UV features plane, $x_i$, is randomly initialized and optimized by both processes. In the denoising process, noise is added to the UV features plane and then denoised following a v-parameterization scheme using a denoising UNet. In the fitting process, the UV features plane is decoded into Gaussian Attribute maps, which are used to generate a 3D-Gaussian-based avatar in canonical space by fetching the corresponding attributes for the initialized Gaussian primitive. Finally, a part-aware deformation module is employed to deform the avatar into targeted pose based on SMPL-X parameters.
  • Figure 2: We demonstrate the effectiveness of our method in achieving precise and robust control over facial expressions and gestures. Our approach enables clear and distinct control over each individual finger, ensuring their visibility and accurate positioning. Additionally, our method exhibits strong robustness when faced with novel poses, producing reasonable and plausible results for facial expressions.
  • Figure 3: Qualitative Comparison. Our method demonstrates superior performance in rendering quality and geometry quality compared to other methods. Due to the challenge of obtaining normals directly from PrimDiffusion, we visualize its mixture primitives as a rough representation of the geometric structure.
  • Figure 4: Ablation on deformation method. Our method achieves more accurate results for a given facial expression compared to the K-nearest neighbors (KNN) based forward skinning method.
  • Figure 5: Our method enables local editing and attribute transfer. In row one, we demonstrate the capability to modify only the nose of the avatar. The shared structure of UV featuers plane allows us to transfer attributes between different subjects, as showcased in row two.