Table of Contents
Fetching ...

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

Yiyu Zhuang, Jiaxi Lv, Hao Wen, Qing Shuai, Ailing Zeng, Hao Zhu, Shifeng Chen, Yujiu Yang, Xun Cao, Wei Liu

TL;DR

This work tackles the problem of instant, photorealistic 3D human creation from a single image by rethinking data, model, and representation. It introduces HuGe100K, a large-scale, multi-view, photorealistic human dataset, and IDOL, a feed-forward transformer that predicts a 3D Gaussian-based avatar in a SMPL-X UV space for fast, animatable reconstruction. The approach demonstrates state-of-the-art quantitative and qualitative results, with support for texture and shape editing and downstream applications such as video reenactment. The combination of large-scale generated data and a uniform, differentiable 3D representation yields robust generalization to diverse appearances, poses, and viewpoints, enabling practical real-time avatar creation and manipulation.

Abstract

Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks. Project page: https://yiyuzhuang.github.io/IDOL/.

IDOL: Instant Photorealistic 3D Human Creation from a Single Image

TL;DR

This work tackles the problem of instant, photorealistic 3D human creation from a single image by rethinking data, model, and representation. It introduces HuGe100K, a large-scale, multi-view, photorealistic human dataset, and IDOL, a feed-forward transformer that predicts a 3D Gaussian-based avatar in a SMPL-X UV space for fast, animatable reconstruction. The approach demonstrates state-of-the-art quantitative and qualitative results, with support for texture and shape editing and downstream applications such as video reenactment. The combination of large-scale generated data and a uniform, differentiable 3D representation yields robust generalization to diverse appearances, poses, and viewpoints, enabling practical real-time avatar creation and manipulation.

Abstract

Creating a high-fidelity, animatable 3D full-body avatar from a single image is a challenging task due to the diverse appearance and poses of humans and the limited availability of high-quality training data. To achieve fast and high-quality human reconstruction, this work rethinks the task from the perspectives of dataset, model, and representation. First, we introduce a large-scale HUman-centric GEnerated dataset, HuGe100K, consisting of 100K diverse, photorealistic sets of human images. Each set contains 24-view frames in specific human poses, generated using a pose-controllable image-to-multi-view model. Next, leveraging the diversity in views, poses, and appearances within HuGe100K, we develop a scalable feed-forward transformer model to predict a 3D human Gaussian representation in a uniform space from a given human image. This model is trained to disentangle human pose, body shape, clothing geometry, and texture. The estimated Gaussians can be animated without post-processing. We conduct comprehensive experiments to validate the effectiveness of the proposed dataset and method. Our model demonstrates the ability to efficiently reconstruct photorealistic humans at 1K resolution from a single input image using a single GPU instantly. Additionally, it seamlessly supports various applications, as well as shape and texture editing tasks. Project page: https://yiyuzhuang.github.io/IDOL/.

Paper Structure

This paper contains 48 sections, 1 equation, 16 figures, 5 tables.

Figures (16)

  • Figure 1: This work introduces (a) IDOL, a feed-forward, single-image human reconstruction framework that is fast, high-fidelity, and generalizable; (b) Utilizing the proposed Large Generated Human Multi-View Dataset consisting of $100K$ multi-view subjects, our method exhibits exceptional generalizability in handling diverse human shapes, cross-domain data, severe viewpoints, and occlusions; (c) With a uniform structured representation, the avatars can be directly animatable and easily editable.
  • Figure 2: Pipeline for constructing our HuGe100K. Diverse attribute combinations from GPT-4 templates create text prompts, generating synthetic images via FLUX, combined with real images from DeepFashion. SMPL-X fitting produces multi-view pose sequences with 360-degree rotations and diverse animatable motions. MVChamp then converts these sequences into multi-view images, ensuring 3D consistency in the dataset.
  • Figure 3: A paired example from the proposed HuGe100K Dataset. For each reference image, we generate a set of multi-view images using an estimated shape and a specific pose. The figure shows the pose is well-aligned.
  • Figure 4: The architecture of IDOL, a full-differentiable transformer-based framework for reconstructing animatable 3D human from a single image. The model integrates a high-resolution ($1024\times1024$) encoder khirodkar2025sapiens and fuses image tokens with learnable UV tokens through the UV-Alignment Transformer. A UV Decoder predicts Gaussian attribute maps as intermediate representations, capturing the human's geometry and appearance in a structured 2D UV space defined by the SMPL-X model. These maps, in conjunction with the SMPL-X model, represent a 3D human avatar in a canonical space, which can be animated using linear blend skinning (LBS). The model is optimized using multi-view images with diverse poses and identities, learning to disentangle pose, appearance, and shape.
  • Figure 5: Qualitative results of our MVChamp ablation study (left) and comparison experiment (right).
  • ...and 11 more figures