Table of Contents
Fetching ...

Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations

Yu Feng, Shunsi Zhang, Jian Shu, Hanfeng Zhao, Guoliang Pang, Chi Zhang, Hao Wang

TL;DR

The paper tackles the challenge of synthesizing consistent multi-view human images from a single input, hindered by limited 3D human data and delicate facial detail preservation. It introduces a two-stage diffusion-based framework that first transfers body information from a pretrained single-view model and then refines faces by fusing 2D identity priors with 3D facial structure priors, aided by SMPL-based normal maps for coarse geometry. Two key contributions are the transferred body representation via a memory-augmented, cross-domain UNet, and the transferred face representation that integrates 2D and 3D facial features for high-fidelity reconstruction. Evaluations on THuman2.1 and 2K2K show state-of-the-art performance in novel-view synthesis, indicating significant practical impact for AR/VR, virtual try-on, and 3D content creation with reduced data requirements.

Abstract

Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model's detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.

Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations

TL;DR

The paper tackles the challenge of synthesizing consistent multi-view human images from a single input, hindered by limited 3D human data and delicate facial detail preservation. It introduces a two-stage diffusion-based framework that first transfers body information from a pretrained single-view model and then refines faces by fusing 2D identity priors with 3D facial structure priors, aided by SMPL-based normal maps for coarse geometry. Two key contributions are the transferred body representation via a memory-augmented, cross-domain UNet, and the transferred face representation that integrates 2D and 3D facial features for high-fidelity reconstruction. Evaluations on THuman2.1 and 2K2K show state-of-the-art performance in novel-view synthesis, indicating significant practical impact for AR/VR, virtual try-on, and 3D content creation with reduced data requirements.

Abstract

Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model's detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.

Paper Structure

This paper contains 16 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: The qualitative comparisons with baseline models on synthesized human multi-view images. Our approach generates superior facial details, producing high-quality and consistent novel human views from a single image input.
  • Figure 2: Overall architecture of our method. Our method consists of two learning phases. At the the first phase, we leverage a pretrained single-view human Li2024CosmicManAT model to learn a transferred body representation. Knowledge transfer is achieved by updating a multi-view human model using pre-trained parameters from a single-view human model. At the second phase, we integrate the 2D and 3D face features to learn a transferred face representation. 3D face features provide structure-accurate priors, while 2D features deliver identity-related priors.
  • Figure 3: Illustration of single-to-multi view knowledge transfer. Single-view feature map from input image is expanded to the same dimension with multi-view feature map from previous timestep $t-1$. Then multi-view feature map at timestep $t$ is obtained through attention operation.
  • Figure 4: Qualitative comparisons between our and the state-of-the-art approaches on the THuman2.1 dataset are presented. The red circle highlights the side-view face area in the synthesized multi-view, where our method demonstrates superior face detail synthesis.
  • Figure 5: Qualitative results of the two transferred representation learning phases. The first row shows the ground truth, the second row displays results from the transferred body representation learning, and the third row shows results with the full representation.
  • ...and 3 more figures