Table of Contents
Fetching ...

Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

Zhibin Liu, Haoye Dong, Aviral Chharia, Hefeng Wu

TL;DR

Human-VDM addresses the challenge of generating lifelike 3D humans from a single RGB image by leveraging a view-consistent video diffusion approach, followed by texture and geometry enhancement through super-resolution and frame interpolation, and final reconstruction via 3D Gaussian Splatting. The method combines a fine-tuned latent video diffusion model (SV3D) with a dedicated video augmentation pipeline (CodeFormer and PerVFI) and a SMPL-based Gaussian Splatting representation with an optimizable appearance tensor. Empirical results demonstrate state-of-the-art performance on single-view 3D human generation, supported by qualitative and quantitative metrics, user studies, and comprehensive ablations that validate each component. While delivering high-quality 3D humans, the approach notes limitations in fine-grained finger geometry and real-time efficiency, suggesting future work on more efficient models and refined digit geometry.

Abstract

Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/

Human-VDM: Learning Single-Image 3D Human Gaussian Splatting from Video Diffusion Models

TL;DR

Human-VDM addresses the challenge of generating lifelike 3D humans from a single RGB image by leveraging a view-consistent video diffusion approach, followed by texture and geometry enhancement through super-resolution and frame interpolation, and final reconstruction via 3D Gaussian Splatting. The method combines a fine-tuned latent video diffusion model (SV3D) with a dedicated video augmentation pipeline (CodeFormer and PerVFI) and a SMPL-based Gaussian Splatting representation with an optimizable appearance tensor. Empirical results demonstrate state-of-the-art performance on single-view 3D human generation, supported by qualitative and quantitative metrics, user studies, and comprehensive ablations that validate each component. While delivering high-quality 3D humans, the approach notes limitations in fine-grained finger geometry and real-time efficiency, suggesting future work on more efficient models and refined digit geometry.

Abstract

Generating lifelike 3D humans from a single RGB image remains a challenging task in computer vision, as it requires accurate modeling of geometry, high-quality texture, and plausible unseen parts. Existing methods typically use multi-view diffusion models for 3D generation, but they often face inconsistent view issues, which hinder high-quality 3D human generation. To address this, we propose Human-VDM, a novel method for generating 3D human from a single RGB image using Video Diffusion Models. Human-VDM provides temporally consistent views for 3D human generation using Gaussian Splatting. It consists of three modules: a view-consistent human video diffusion module, a video augmentation module, and a Gaussian Splatting module. First, a single image is fed into a human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and video interpolation to enhance the textures and geometric smoothness of the generated video. Finally, the 3D Human Gaussian Splatting module learns lifelike humans under the guidance of these high-resolution and view-consistent images. Experiments demonstrate that Human-VDM achieves high-quality 3D human from a single image, outperforming state-of-the-art methods in both generation quality and quantity. Project page: https://human-vdm.github.io/Human-VDM/
Paper Structure (21 sections, 15 equations, 8 figures, 2 tables)

This paper contains 21 sections, 15 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Human-VDM for generating 3D humans from a single image. Given a single RGB human image, Human-VDM aims to generate high-fidelity 3D human. Human-VDM preserves face identity, delivers realistic texture, ensures accurate geometry, and maintains a valid pose of the generated 3D human, surpassing the current state-of-the-art models.
  • Figure 2: Human-VDM model architecture. An image $I$ is first input to a view-consistent human video diffusion module to generate a coherent human video. Next, the video augmentation module applies super-resolution and frame interpolation to enhance texture and generate high-quality interpolated frames. Finally, 3D Human Gaussian splatting learns lifelike 3D humans.
  • Figure 3: Qualitative Results. Novel view results from Human-VDM with various poses, genders, diverse clothing, and different hairstyles demonstrate the robustness of the proposed Human-VDM model. It consistently achieves high photo-realistic quality and precise geometric accuracy. zoom in for details.
  • Figure 4: Qualitative Comparison. Human-VDM compared to other SOTA models including PIFu saito2019pifu, PaMIR zheng2021pamir, TeCH huang2024tech, Ultraman chen2024ultraman, SiTH ho2024sith, and SIFU zhang2024sifu. The results demonstrate that Human-VDM achieves superior 3D human generation quality. Note that recent SOTAs fail to predict the unseen back view as shown above. zoom in for details.
  • Figure 5: Qualitative Visual Ablation Comparisons. Compared to other variants, the proposed full model achieves highly realistic textures and accurate geometry.
  • ...and 3 more figures