Table of Contents
Fetching ...

HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

Keito Suzuki, Kunyao Chen, Lei Wang, Bang Du, Runfa Blark Li, Peng Liu, Ning Bi, Truong Nguyen

TL;DR

Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.

Abstract

We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.

HumanOrbit: 3D Human Reconstruction as 360° Orbit Generation

TL;DR

Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.

Abstract

We present a method for generating a full 360° orbit video around a person from a single input image. Existing methods typically adapt image-based diffusion models for multi-view synthesis, but yield inconsistent results across views and with the original identity. In contrast, recent video diffusion models have demonstrated their ability in generating photorealistic results that align well with the given prompts. Inspired by these results, we propose HumanOrbit, a video diffusion model for multi-view human image generation. Our approach enables the model to synthesize continuous camera rotations around the subject, producing geometrically consistent novel views while preserving the appearance and identity of the person. Using the generated multi-view frames, we further propose a reconstruction pipeline that recovers a textured mesh of the subject. Experimental results validate the effectiveness of HumanOrbit for multi-view image generation and that the reconstructed 3D models exhibit superior completeness and fidelity compared to those from state-of-the-art baselines.
Paper Structure (13 sections, 5 equations, 10 figures, 1 table)

This paper contains 13 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Given an in-the-wild image, we treat the ill-posed problem of multi-view synthesis as orbit video generation for robust 3D human reconstruction. Our model creates identity preserving and view consistent frames, resulting in a high quality textured mesh.
  • Figure 2: The proposed HumanOrbit model. We finetune a DiT-based video diffusion model such that it directly generates a 360° orbit video given a single input image. While keeping much of the architecture frozen, the finetuned LoRAs learn to accurately conduct an orbit around the subject, generating consistent multi-view images.
  • Figure 3: The proposed mesh reconstruction framework. Given the generated multi-view images, we first apply a SfM method to obtain the point cloud and camera parameters for each view. We then estimate the normal maps for each frame. Finally, using the previously generated results, the textured mesh is reconstructed via an explicit mesh carving method.
  • Figure 4: Two examples of the camera pose and point cloud predicted by VGGT on our generated multi-view images. The original images are shown in Figure \ref{['Fig.cover']}.
  • Figure 5: Visual comparison of multi-view generation results on full body images from the CCP dataset CCP. Given the input image, we show three generated views (left, back, and right) from each method. Zoom in for more detail.
  • ...and 5 more figures