DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis
Yuming Gu, You Xie, Hongyi Xu, Guoxian Song, Yichun Shi, Di Chang, Jing Yang, Linjie Luo
TL;DR
DiffPortrait3D tackles zero-shot portrait novel-view synthesis from a single image by extending a frozen 2D latent diffusion backbone with explicitly disentangled appearance and view controls. It introduces an appearance-reference module that injects reference appearance into self-attention, a ControlNet-inspired view-control module to steer camera pose from a condition image, and a cross-view attention-based view-consistency mechanism reinforced by 3D-aware noise during inference. Trained in staged fashion on real multi-view data and synthetic renders, the method achieves 3D-consistent, high-fidelity portrait views without runtime fine-tuning and demonstrates strong generalization to in-the-wild portraits and diverse expressions and styles. Quantitative and qualitative results show state-of-the-art performance on challenging benchmarks, with robust reconstruction and plausible multi-view synthesis, pointing to practical impact in 3D avatars, digital visual effects, and immersive storytelling. The work also discusses limitations such as occasional flicker in unseen regions and outlines directions for longer-range consistency and multi-source appearance integration, along with ethical considerations for misuse.
Abstract
We present DiffPortrait3D, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically, given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning, our zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions. At its core, we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this, we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore, we insert a trainable cross-view attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.
