Table of Contents
Fetching ...

Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

Mingjin Chen, Junhao Chen, Xiaojun Ye, Huan-ang Gao, Xiaoxue Chen, Zhaoxin Fan, Hao Zhao

TL;DR

Ultraman tackles the ill-posed problem of reconstructing textured 3D humans from a single image by integrating a depth-informed mesh reconstruction with a diffusion-based, multi-view texture generation pipeline guided by prompts and viewpoints. A GPT4V-driven VQA module yields detailed prompts, while a view-strategy samples ten viewpoints and a texture-mapping stage projects synthesized views onto a UV mesh with seam-smoothing and region-aware generation masks. The approach uses IP-Adapter and ControlNet to achieve view- and depth-consistent texture generation, followed by a seam-aware texturing process that preserves frontal details while enriching back and side textures. Empirical results show Ultraman dramatically reduces inference time (about 20–30 minutes) and outperforms state-of-the-art methods in both geometry and appearance across standard datasets, with strong user-pref consistency, enabling fast production of high-fidelity digital humans for AR/VR and related applications.

Abstract

3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.

Ultraman: Single Image 3D Human Reconstruction with Ultra Speed and Detail

TL;DR

Ultraman tackles the ill-posed problem of reconstructing textured 3D humans from a single image by integrating a depth-informed mesh reconstruction with a diffusion-based, multi-view texture generation pipeline guided by prompts and viewpoints. A GPT4V-driven VQA module yields detailed prompts, while a view-strategy samples ten viewpoints and a texture-mapping stage projects synthesized views onto a UV mesh with seam-smoothing and region-aware generation masks. The approach uses IP-Adapter and ControlNet to achieve view- and depth-consistent texture generation, followed by a seam-aware texturing process that preserves frontal details while enriching back and side textures. Empirical results show Ultraman dramatically reduces inference time (about 20–30 minutes) and outperforms state-of-the-art methods in both geometry and appearance across standard datasets, with strong user-pref consistency, enabling fast production of high-fidelity digital humans for AR/VR and related applications.

Abstract

3D human body reconstruction has been a challenge in the field of computer vision. Previous methods are often time-consuming and difficult to capture the detailed appearance of the human body. In this paper, we propose a new method called \emph{Ultraman} for fast reconstruction of textured 3D human models from a single image. Compared to existing techniques, \emph{Ultraman} greatly improves the reconstruction speed and accuracy while preserving high-quality texture details. We present a set of new frameworks for human reconstruction consisting of three parts, geometric reconstruction, texture generation and texture mapping. Firstly, a mesh reconstruction framework is used, which accurately extracts 3D human shapes from a single image. At the same time, we propose a method to generate a multi-view consistent image of the human body based on a single image. This is finally combined with a novel texture mapping method to optimize texture details and ensure color consistency during reconstruction. Through extensive experiments and evaluations, we demonstrate the superior performance of \emph{Ultraman} on various standard datasets. In addition, \emph{Ultraman} outperforms state-of-the-art methods in terms of human rendering quality and speed. Upon acceptance of the article, we will make the code and data publicly available.
Paper Structure (26 sections, 3 equations, 8 figures, 2 tables)

This paper contains 26 sections, 3 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Given a single RGB image of a human, Ultraman successfully reconstructs the surface geometry and appearance of the clothed human body in 30 minutes, while the state-of-the-art method TeCH huang2023tech takes 5 hours on average. It also achieves a high degree of consistency in the back view with the front view, while TeCH struggles to generate meaningful backview appearance.
  • Figure 2: Overview of the framework of Ultraman. Ultraman takes a image $I_0$ of human as input. The blue rounded dashed rectangle is denoted as the Prompt Generation Module and the responses to the questions are generated by GPT4v openai2023gpt4v. The red rounded dashed rectangles are indicated as mesh reconstruction modules. It generates the mesh and UV map. The yellow rounded dashed rectangle is the multi-view texture generation module. It includes a control model $\mathcal{M}_c$, which controls the generation of texture in the current viewpoint by accepting the prompt from the current viewpoint, and by using the depth map rendered by the mesh and the input image. The Texturing module pastes the corresponding texture back onto the mesh according to the generation mask.
  • Figure 3: We control it by ipadapter and depth as controlnet unit, use a single portrait on the front side as a reference image, and use VQA to generate prompt as a textual cue for SD-XL, and realize the generation of consistent multi-view image on the person. Where the depth map is derived from the depth map rendered from the mesh exported by the Mesh Reconstruction step at the specified viewpoint.
  • Figure 4: We categorize generation masks in the current perspective into four types. We never update the texture in the "always keep" area, which is the front texture of the human body in the initial view and does not need to be regenerated. The "keep" area is the texture that has been best observed in other views and does not need to be updated. The "update" area is where we will update the texture a little bit. The "new" area is where we need to generate new textures.
  • Figure 5: Seam smooth effect and process. We show the before and after changes for seam smoothing. We find their canny map through the mask of the old regions that already exist $C_{old}$ and the new regions that are generated in the current viewpoint $C_{new}$. Their canny maps are intersected and extended appropriately. We used the open source toolkit clipdrop for our smooth process. In different viewpoints, $C_{new}$ will be different. When there are no pixels in the current view that need to be updated, we find the canny of the New area mask. When there are no pixels in the current view that need to be updated, we find canny of the mask in the update area.
  • ...and 3 more figures