Table of Contents
Fetching ...

SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

Yuhang Yang, Fengqi Liu, Yixing Lu, Qin Zhao, Pingyu Wu, Wei Zhai, Ran Yi, Yang Cao, Lizhuang Ma, Zheng-Jun Zha, Junting Dong

TL;DR

This work tackles the challenge of 3D human digitization under scarce assets and ill-posed low-to-high-dimensional mappings by introducing a latent-space generation framework. It combines a UV-structured VAE to compress multi-view data into Gaussian latents with an MM-DiT-based conditional generator to produce 3D Gaussians, reframing the problem as a conditional-to-latent distribution transfer and enabling end-to-end inference. A large-scale HGS-1M dataset of one million 3D human Gaussians is constructed from multi-view optimizations and synthetic data, enabling scalable training and robust rendering of textured, pose-dependent humans. The results show high-fidelity Gaussians with fine facial details and loose clothing deformation, highlighting the practicality of large-scale latent generation for 3D human digitization and its potential impact on AR/VR, gaming, and animation.

Abstract

3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains $1$ million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.

SIGMAN:Scaling 3D Human Gaussian Generation with Millions of Assets

TL;DR

This work tackles the challenge of 3D human digitization under scarce assets and ill-posed low-to-high-dimensional mappings by introducing a latent-space generation framework. It combines a UV-structured VAE to compress multi-view data into Gaussian latents with an MM-DiT-based conditional generator to produce 3D Gaussians, reframing the problem as a conditional-to-latent distribution transfer and enabling end-to-end inference. A large-scale HGS-1M dataset of one million 3D human Gaussians is constructed from multi-view optimizations and synthetic data, enabling scalable training and robust rendering of textured, pose-dependent humans. The results show high-fidelity Gaussians with fine facial details and loose clothing deformation, highlighting the practicality of large-scale latent generation for 3D human digitization and its potential impact on AR/VR, gaming, and animation.

Abstract

3D human digitization has long been a highly pursued yet challenging task. Existing methods aim to generate high-quality 3D digital humans from single or multiple views, but remain primarily constrained by current paradigms and the scarcity of 3D human assets. Specifically, recent approaches fall into several paradigms: optimization-based and feed-forward (both single-view regression and multi-view generation with reconstruction). However, they are limited by slow speed, low quality, cascade reasoning, and ambiguity in mapping low-dimensional planes to high-dimensional space due to occlusion and invisibility, respectively. Furthermore, existing 3D human assets remain small-scale, insufficient for large-scale training. To address these challenges, we propose a latent space generation paradigm for 3D human digitization, which involves compressing multi-view images into Gaussians via a UV-structured VAE, along with DiT-based conditional generation, we transform the ill-posed low-to-high-dimensional mapping problem into a learnable distribution shift, which also supports end-to-end inference. In addition, we employ the multi-view optimization approach combined with synthetic data to construct the HGS-1M dataset, which contains million 3D Gaussian assets to support the large-scale training. Experimental results demonstrate that our paradigm, powered by large-scale training, produces high-quality 3D human Gaussians with intricate textures, facial details, and loose clothing deformation.

Paper Structure

This paper contains 19 sections, 5 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: (a). In this work, we construct a large-scale, unified 3D Human Gaussian Dataset, called HGS-1M, to support (b) the large-scale generation model for 3D human Gaussian generation. (c) This paradigm, with large-scale data, produces high-quality 3D human Gaussians that exhibit complex textures, facial details, and realistic deformation of loose clothing.
  • Figure 2: HGM-1M Dataset. The constructed HGS-1M dataset, it contains $1$ million 3D Gaussian human assets of different ages, races, appearances, and poses. It supports free-view rendering.
  • Figure 3: Method. The pipeline of our method. (a). The UV-structured VAE, which uses human priors to define learnable tokens in UV space and takes them to query multi-view contexts to model the Gaussian latent, then, the latent is decoded into human Gaussians in canonical space and could be driven by differentiable LBS to obtain the final posed human Gaussian. (b). The MM-DiT architecture, it treats the conditional sequence and noise as a whole sequence to complete controllable 3D human Gaussian generation.
  • Figure 4: Qualitative comparison with baselines, including inputs of different ages, races, viewpoints, and complex textures, loose cloth. Our results demonstrate more details of faces and clothes. Note: We keep the original best inference setting of the comparison method. SIFU uses one view, LGM uses four views, and GHG uses three views.
  • Figure 5: The loose clothing, e.g., skirts, demonstrate nature deformation under novel poses.
  • ...and 5 more figures