Table of Contents
Fetching ...

InfiniHuman: Infinite 3D Human Creation with Precise Control

Yuxuan Xue, Xianghui Xie, Margaret Kostyrko, Gerard Pons-Moll

TL;DR

InfiniHuman tackles the scarcity and cost of richly annotated 3D human data by distilling foundation models into InfiniHumanData, a large-scale, multi-modal identity dataset with 111K identities and rich annotations, and InfiniHumanGen, a dual-model system (Gen-Schnell for fast Gaussian-splat avatars and Gen-HRes for high-resolution textured meshes) conditioned on text, SMPL body shape, and clothing images. The approach uses orthographic multi-view diffusion to generate consistent, view-aligned data and a joint conditional distribution to unify modalities, enabling precise control over appearance, pose, and clothing. Experimental results show state-of-the-art visual quality, faster high-resolution generation, robust attribute controllability, and user-level realism comparable to real scans, with practical applications in try-on, re-animation, and physical fabrication. The work demystifies scalable, controllable 3D avatar creation and provides open-source tools and data to accelerate research and real-world deployment in fashion, gaming, and AR/VR.

Abstract

Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.

InfiniHuman: Infinite 3D Human Creation with Precise Control

TL;DR

InfiniHuman tackles the scarcity and cost of richly annotated 3D human data by distilling foundation models into InfiniHumanData, a large-scale, multi-modal identity dataset with 111K identities and rich annotations, and InfiniHumanGen, a dual-model system (Gen-Schnell for fast Gaussian-splat avatars and Gen-HRes for high-resolution textured meshes) conditioned on text, SMPL body shape, and clothing images. The approach uses orthographic multi-view diffusion to generate consistent, view-aligned data and a joint conditional distribution to unify modalities, enabling precise control over appearance, pose, and clothing. Experimental results show state-of-the-art visual quality, faster high-resolution generation, robust attribute controllability, and user-level realism comparable to real scans, with practical applications in try-on, re-animation, and physical fabrication. The work demystifies scalable, controllable 3D avatar creation and provides open-source tools and data to accelerate research and real-world deployment in fashion, gaming, and AR/VR.

Abstract

Generating realistic and controllable 3D human avatars is a long-standing challenge, particularly when covering broad attribute ranges such as ethnicity, age, clothing styles, and detailed body shapes. Capturing and annotating large-scale human datasets for training generative models is prohibitively expensive and limited in scale and diversity. The central question we address in this paper is: Can existing foundation models be distilled to generate theoretically unbounded, richly annotated 3D human data? We introduce InfiniHuman, a framework that synergistically distills these models to produce richly annotated human data at minimal cost and with theoretically unlimited scalability. We propose InfiniHumanData, a fully automatic pipeline that leverages vision-language and image generation models to create a large-scale multi-modal dataset. User study shows our automatically generated identities are undistinguishable from scan renderings. InfiniHumanData contains 111K identities spanning unprecedented diversity. Each identity is annotated with multi-granularity text descriptions, multi-view RGB images, detailed clothing images, and SMPL body-shape parameters. Building on this dataset, we propose InfiniHumanGen, a diffusion-based generative pipeline conditioned on text, body shape, and clothing assets. InfiniHumanGen enables fast, realistic, and precisely controllable avatar generation. Extensive experiments demonstrate significant improvements over state-of-the-art methods in visual quality, generation speed, and controllability. Our approach enables high-quality avatar generation with fine-grained control at effectively unbounded scale through a practical and affordable solution. We will publicly release the automatic data generation pipeline, the comprehensive InfiniHumanData dataset, and the InfiniHumanGen models at https://yuxuan-xue.com/infini-human.

Paper Structure

This paper contains 38 sections, 2 equations, 24 figures, 2 tables.

Figures (24)

  • Figure 1: Examples from InfiniHumanData. a) Diverse human identities covering a wide range of ethnicities, age groups (including children), clothing styles, hair types, and skin tones, which are visually indistinguishable from real scans rendering (Sec. \ref{['sec:evaluation']}). b) Multi-modal annotations per each subject, including I) multi-view RGB images (full-body and head), II) SMPL parameters, III) clothing asset images, and IV) multi-granularity text descriptions.
  • Figure 2: Overview of data generation framework in InfiniHumanData. The process is fully automated by leveraging foundation models. Desired outputs are marked with flags: A) Structured text descriptions, C) Clothing style images, E) Body shape in SMPL format plus face and hand keypoints, F) Orthographic multi-view images with controlled lighting conditions suitable for 3D lifting.
  • Figure 5: Generate avatars with given garment from fashion industry. The identity is preserved while TryOn garment is changing.
  • Figure 6: Generate avatars with precise pose shape control and text-based editing. The identity is preserved during shape and text-based editing.
  • Figure 7: Qualitative comparison to SOTA text-to-3D avatar generators. We compare with SDS-based avatar generation methods and a mesh-based avatar generation method Chupa kim2023chupa. Our generator can follow the text very well and also achieve outstanding generation quality.
  • ...and 19 more figures