Human Multi-View Synthesis from a Single-View Model:Transferred Body and Face Representations
Yu Feng, Shunsi Zhang, Jian Shu, Hanfeng Zhao, Guoliang Pang, Chi Zhang, Hao Wang
TL;DR
The paper tackles the challenge of synthesizing consistent multi-view human images from a single input, hindered by limited 3D human data and delicate facial detail preservation. It introduces a two-stage diffusion-based framework that first transfers body information from a pretrained single-view model and then refines faces by fusing 2D identity priors with 3D facial structure priors, aided by SMPL-based normal maps for coarse geometry. Two key contributions are the transferred body representation via a memory-augmented, cross-domain UNet, and the transferred face representation that integrates 2D and 3D facial features for high-fidelity reconstruction. Evaluations on THuman2.1 and 2K2K show state-of-the-art performance in novel-view synthesis, indicating significant practical impact for AR/VR, virtual try-on, and 3D content creation with reduced data requirements.
Abstract
Generating multi-view human images from a single view is a complex and significant challenge. Although recent advancements in multi-view object generation have shown impressive results with diffusion models, novel view synthesis for humans remains constrained by the limited availability of 3D human datasets. Consequently, many existing models struggle to produce realistic human body shapes or capture fine-grained facial details accurately. To address these issues, we propose an innovative framework that leverages transferred body and facial representations for multi-view human synthesis. Specifically, we use a single-view model pretrained on a large-scale human dataset to develop a multi-view body representation, aiming to extend the 2D knowledge of the single-view model to a multi-view diffusion model. Additionally, to enhance the model's detail restoration capability, we integrate transferred multimodal facial features into our trained human diffusion model. Experimental evaluations on benchmark datasets demonstrate that our approach outperforms the current state-of-the-art methods, achieving superior performance in multi-view human synthesis.
