Table of Contents
Fetching ...

AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

Lingteng Qiu, Shenhao Zhu, Qi Zuo, Xiaodong Gu, Yuan Dong, Junfei Zhang, Chao Xu, Zhe Li, Weihao Yuan, Liefeng Bo, Guanying Chen, Zilong Dong

TL;DR

This paper adapts a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization and proposes a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference.

Abstract

Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.

AniGS: Animatable Gaussian Avatar from a Single Image with Inconsistent Gaussian Reconstruction

TL;DR

This paper adapts a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization and proposes a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference.

Abstract

Generating animatable human avatars from a single image is essential for various digital human modeling applications. Existing 3D reconstruction methods often struggle to capture fine details in animatable models, while generative approaches for controllable animation, though avoiding explicit 3D modeling, suffer from viewpoint inconsistencies in extreme poses and computational inefficiencies. In this paper, we address these challenges by leveraging the power of generative models to produce detailed multi-view canonical pose images, which help resolve ambiguities in animatable human reconstruction. We then propose a robust method for 3D reconstruction of inconsistent images, enabling real-time rendering during inference. Specifically, we adapt a transformer-based video generation model to generate multi-view canonical pose images and normal maps, pretraining on a large-scale video dataset to improve generalization. To handle view inconsistencies, we recast the reconstruction problem as a 4D task and introduce an efficient 3D modeling approach using 4D Gaussian Splatting. Experiments demonstrate that our method achieves photorealistic, real-time animation of 3D human avatars from in-the-wild images, showcasing its effectiveness and generalization capability.

Paper Structure

This paper contains 50 sections, 5 equations, 22 figures, 3 tables.

Figures (22)

  • Figure 1: 3D Avatar Reconstruction and Animation Results of AniGS. Given a single human image as input, AniGS is capable of reconstructing a high-fidelity 3D avatar in a canonical pose, which can be used for both photorealistic rendering and real-time animation.
  • Figure 2: Overview of the proposed AniGS. In the first stage, a reference image-guided video generation model is employed to produce high-quality multi-view canonical human images along with their corresponding normals, based on the input image. In the second stage, a robust 3D model reconstruction method is applied, using 4D Gaussian Splatting (4DGS) optimization to handle subtle appearance variations across the generated views.
  • Figure 3: Inconsistencies caused by subtle variations in the generated multi-view images, which will degrade the 3D reconstruction quality. The red boxes highlights the inconsistent areas.
  • Figure 4: Visual comparison of animation results for the reconstructed 3D avatars. Best viewed with zoom-in.
  • Figure 5: Visual comparison on canonical pose 3D avatar reconstruction from the single-view image. Since SiTH, MagicMan, and LGM cannot reconstruct canonical pose shapes from the input, we take our generated front-view canonical pose image as input to these methods.
  • ...and 17 more figures