Table of Contents
Fetching ...

3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling

Zichen Tang, Hongyu Yang, Hanchen Zhang, Jiaxin Chen, Di Huang

TL;DR

This work tackles the challenge of reconstructing animatable 3D human avatars from sparse multi-view RGB data, addressing pose generalization and temporal consistency. It introduces 3D^2-Actor, a 3D-aware denoiser that interleaves pose-conditioned 2D denoising with a 3D Gaussian-based rectifier, and uses a two-stage projection with a mesh-guided local coordinate representation. A Gaussian consistency sampling strategy is proposed to ensure smooth inter-frame transitions without additional temporal modules. Experiments on ZJU-MoCap demonstrate high-fidelity avatar modeling and robust novel-pose generalization, with effective continuous video synthesis and ablations validating the design choices.

Abstract

Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D$^2$-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D$^2$-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses. Code is available at: https://github.com/silence-tang/GaussianActor.

3D$^2$-Actor: Learning Pose-Conditioned 3D-Aware Denoiser for Realistic Gaussian Avatar Modeling

TL;DR

This work tackles the challenge of reconstructing animatable 3D human avatars from sparse multi-view RGB data, addressing pose generalization and temporal consistency. It introduces 3D^2-Actor, a 3D-aware denoiser that interleaves pose-conditioned 2D denoising with a 3D Gaussian-based rectifier, and uses a two-stage projection with a mesh-guided local coordinate representation. A Gaussian consistency sampling strategy is proposed to ensure smooth inter-frame transitions without additional temporal modules. Experiments on ZJU-MoCap demonstrate high-fidelity avatar modeling and robust novel-pose generalization, with effective continuous video synthesis and ablations validating the design choices.

Abstract

Advancements in neural implicit representations and differentiable rendering have markedly improved the ability to learn animatable 3D avatars from sparse multi-view RGB videos. However, current methods that map observation space to canonical space often face challenges in capturing pose-dependent details and generalizing to novel poses. While diffusion models have demonstrated remarkable zero-shot capabilities in 2D image generation, their potential for creating animatable 3D avatars from 2D inputs remains underexplored. In this work, we introduce 3D-Actor, a novel approach featuring a pose-conditioned 3D-aware human modeling pipeline that integrates iterative 2D denoising and 3D rectifying steps. The 2D denoiser, guided by pose cues, generates detailed multi-view images that provide the rich feature set necessary for high-fidelity 3D reconstruction and pose rendering. Complementing this, our Gaussian-based 3D rectifier renders images with enhanced 3D consistency through a two-stage projection strategy and a novel local coordinate representation. Additionally, we propose an innovative sampling strategy to ensure smooth temporal continuity across frames in video synthesis. Our method effectively addresses the limitations of traditional numerical solutions in handling ill-posed mappings, producing realistic and animatable 3D human avatars. Experimental results demonstrate that 3D-Actor excels in high-fidelity avatar modeling and robustly generalizes to novel poses. Code is available at: https://github.com/silence-tang/GaussianActor.

Paper Structure

This paper contains 18 sections, 10 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Illustration of the 3D-aware denoising process. (a) The 3D-aware denoising pipeline consists of interlaced 2D and 3D steps. It begins with pure noise input, progressively generating realistic multi-view images of the human avatar with the assistance of pose information. (b) Guided by body segmentation maps as pose cues, the 2D denoiser (blue box in (a)) transforms noised images from the previous 2D or 3D steps into clean ones with enhanced intricate details. It also provides clean images for the subsequent 3D rectifier to achieve accurate 3D human avatar modeling. (c) Given clean images from $N$ anchor views, the 3D rectifier (green box in (a)) performs a two-stage projection leveraging a mesh-based Gaussian local coordinate representation to reconstruct 3D Gaussians, enabling the rendering of multi-view human images with high 3D consistency.
  • Figure 2: An illustration of the inter-frame Gaussian consistency sampling strategy for improving temporal continuity.
  • Figure 3: Qualitative comparison of single-frame novel pose synthesis results against ARAH wang2022arah and PoseVocab li2023posevocab on sequences 313 and 315 of the ZJU-MoCap dataset. Please zoom in for better observation.
  • Figure 4: Consecutive frame generation results. Top row shows results using the proposed sampling strategy; bottom row displays results from independent sampling.
  • Figure 5: Novel pose synthesis results with different designs of our 3D-aware denoiser.
  • ...and 1 more figures