Table of Contents
Fetching ...

LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D

Lingteng Qiu, Peihao Li, Heyuan Li, Qi Zuo, Xiaodong Gu, Yuan Dong, Weihao Yuan, Rui Peng, Siyu Zhu, Xiaoguang Han, Guanying Chen, Zilong Dong

Abstract

Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder-Decoder Point-Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines the rendering quality of reconstructed avatars in real time. Extensive experiments show that our method produces high-fidelity, animatable 3D humans without requiring camera or pose annotations. Our code and project page are available at https://lingtengqiu.github.io/LHM++/

LHM++: An Efficient Large Human Reconstruction Model for Pose-free Images to 3D

Abstract

Reconstructing animatable 3D humans from casually captured images of articulated subjects without camera or pose information is highly practical but remains challenging due to view misalignment, occlusions, and the absence of structural priors. In this work, we present LHM++, an efficient large-scale human reconstruction model that generates high-quality, animatable 3D avatars within seconds from one or multiple pose-free images. At its core is an Encoder-Decoder Point-Image Transformer architecture that progressively encodes and decodes 3D geometric point features to improve efficiency, while fusing hierarchical 3D point features with image features through multimodal attention. The fused features are decoded into 3D Gaussian splats to recover detailed geometry and appearance. To further enhance visual fidelity, we introduce a lightweight 3D-aware neural animation renderer that refines the rendering quality of reconstructed avatars in real time. Extensive experiments show that our method produces high-fidelity, animatable 3D humans without requiring camera or pose annotations. Our code and project page are available at https://lingtengqiu.github.io/LHM++/

Paper Structure

This paper contains 72 sections, 12 equations, 17 figures, 14 tables.

Figures (17)

  • Figure 1: 3D Avatar Reconstruction and Animation Results of our LHM++. Given a set of $N \ge 1$ images of a human subject, without requiring camera parameters or human pose annotations, our method can reconstruct a high-fidelity, animatable 3D human avatar in seconds.
  • Figure 2: Overview of the proposed LHM++. In 2D space, we extract image tokens $\mathbf{T}_{\text{2D}}$ from input RGB images by DINOv2. In 3D space, geometric tokens $\mathbf{T}_{\text{3D}}$ are derived from SMPL-X anchor points via an MLP. Next, we design an Encoder-Decoder Point-Image Transformer (PIT) to hierarchically fuse 3D and 2D tokens, where the downsampled 3D tokens interact with 2D tokens via multi-modal attention in each layer. The final 3D tokens are decoded to predict 3D Gaussian parameters, followed by a light-weight DPT head for photorealistic animation.
  • Figure 3: Animatable human reconstruction comparisons from sparse images.
  • Figure 4: Human reconstruction of LHM++ in canonical space from sparse image inputs.
  • Figure 5: Qualitative comparison with LHM. LHM++ matches LHM's quality with a single input view and generates progressively more detailed results as view count increases. Please zoom in for better view.
  • ...and 12 more figures