Table of Contents
Fetching ...

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

Lingteng Qiu, Xiaodong Gu, Peihao Li, Qi Zuo, Weichao Shen, Junfei Zhang, Kejie Qiu, Weihao Yuan, Guanying Chen, Zilong Dong, Liefeng Bo

TL;DR

LHM tackles the challenge of producing animatable 3D human avatars from a single image in seconds by introducing a feed-forward Multimodal Body-Head Transformer that fuses 3D geometric tokens with 2D image features. It outputs a 3D Gaussian Splatting (3DGS) avatar in canonical space and enables animation using SMPL-X-based initialization and Linear Blend Skinning, trained with view-space photometric losses and canonical-space regularization. Key innovations include Head Feature Pyramid Encoding for facial detail, Head Token Shrinkage Regularization to balance attention, and a canonical-space regularization regime combining ASAP and ACAP. Trained on a large-scale in-the-wild video corpus with synthetic augmentation, LHM achieves state-of-the-art generalization and animation consistency, enabling real-time rendering without post-processing for face and hands.

Abstract

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.

LHM: Large Animatable Human Reconstruction Model from a Single Image in Seconds

TL;DR

LHM tackles the challenge of producing animatable 3D human avatars from a single image in seconds by introducing a feed-forward Multimodal Body-Head Transformer that fuses 3D geometric tokens with 2D image features. It outputs a 3D Gaussian Splatting (3DGS) avatar in canonical space and enables animation using SMPL-X-based initialization and Linear Blend Skinning, trained with view-space photometric losses and canonical-space regularization. Key innovations include Head Feature Pyramid Encoding for facial detail, Head Token Shrinkage Regularization to balance attention, and a canonical-space regularization regime combining ASAP and ACAP. Trained on a large-scale in-the-wild video corpus with synthetic augmentation, LHM achieves state-of-the-art generalization and animation consistency, enabling real-time rendering without post-processing for face and hands.

Abstract

Animatable 3D human reconstruction from a single image is a challenging problem due to the ambiguity in decoupling geometry, appearance, and deformation. Recent advances in 3D human reconstruction mainly focus on static human modeling, and the reliance of using synthetic 3D scans for training limits their generalization ability. Conversely, optimization-based video methods achieve higher fidelity but demand controlled capture conditions and computationally intensive refinement processes. Motivated by the emergence of large reconstruction models for efficient static reconstruction, we propose LHM (Large Animatable Human Reconstruction Model) to infer high-fidelity avatars represented as 3D Gaussian splatting in a feed-forward pass. Our model leverages a multimodal transformer architecture to effectively encode the human body positional features and image features with attention mechanism, enabling detailed preservation of clothing geometry and texture. To further boost the face identity preservation and fine detail recovery, we propose a head feature pyramid encoding scheme to aggregate multi-scale features of the head regions. Extensive experiments demonstrate that our LHM generates plausible animatable human in seconds without post-processing for face and hands, outperforming existing methods in both reconstruction accuracy and generalization ability.

Paper Structure

This paper contains 43 sections, 11 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: 3D Avatar Reconstruction and Animation Results of our LHM. Our method reconstructs an animatable human avatar in a single feed-forward pass in seconds. The resulting model supports real-time rendering and pose-controlled animation.
  • Figure 2: Overview of the proposed LHM. Our method extracts body and head image tokens from the input image, and utilizes the proposed Multimodal Body-Head Transformer (MBHT) to fuse the 3D geometric body tokens with the image tokens. After the attention-based fusion process, the geometric body tokens are decoded into Gaussian parameters.
  • Figure 3: Architecture of the proposed Multimodal Body-Head Transformer Block (MBHT-block).
  • Figure 4: Single-view reconstruction comparisons on DeepFashion liuLQWTcvpr16DeepFashion and in-the-wild images. LHM achieves superior appearance fidelity and texture sharpness, particularly evident in facial details and garment wrinkles.
  • Figure 5: Single-view animatable human reconstruction comparisons on in-the-wild sequences. LHM produces more accurate and photorealistic animation results than the baseline methods. Note that the results of AniGS are not faithful to the input images.
  • ...and 6 more figures