Table of Contents
Fetching ...

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

Yuming Gu, You Xie, Hongyi Xu, Guoxian Song, Yichun Shi, Di Chang, Jing Yang, Linjie Luo

TL;DR

DiffPortrait3D tackles zero-shot portrait novel-view synthesis from a single image by extending a frozen 2D latent diffusion backbone with explicitly disentangled appearance and view controls. It introduces an appearance-reference module that injects reference appearance into self-attention, a ControlNet-inspired view-control module to steer camera pose from a condition image, and a cross-view attention-based view-consistency mechanism reinforced by 3D-aware noise during inference. Trained in staged fashion on real multi-view data and synthetic renders, the method achieves 3D-consistent, high-fidelity portrait views without runtime fine-tuning and demonstrates strong generalization to in-the-wild portraits and diverse expressions and styles. Quantitative and qualitative results show state-of-the-art performance on challenging benchmarks, with robust reconstruction and plausible multi-view synthesis, pointing to practical impact in 3D avatars, digital visual effects, and immersive storytelling. The work also discusses limitations such as occasional flicker in unseen regions and outlines directions for longer-range consistency and multi-source appearance integration, along with ethical considerations for misuse.

Abstract

We present DiffPortrait3D, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically, given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning, our zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions. At its core, we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this, we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore, we insert a trainable cross-view attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.

DiffPortrait3D: Controllable Diffusion for Zero-Shot Portrait View Synthesis

TL;DR

DiffPortrait3D tackles zero-shot portrait novel-view synthesis from a single image by extending a frozen 2D latent diffusion backbone with explicitly disentangled appearance and view controls. It introduces an appearance-reference module that injects reference appearance into self-attention, a ControlNet-inspired view-control module to steer camera pose from a condition image, and a cross-view attention-based view-consistency mechanism reinforced by 3D-aware noise during inference. Trained in staged fashion on real multi-view data and synthetic renders, the method achieves 3D-consistent, high-fidelity portrait views without runtime fine-tuning and demonstrates strong generalization to in-the-wild portraits and diverse expressions and styles. Quantitative and qualitative results show state-of-the-art performance on challenging benchmarks, with robust reconstruction and plausible multi-view synthesis, pointing to practical impact in 3D avatars, digital visual effects, and immersive storytelling. The work also discusses limitations such as occasional flicker in unseen regions and outlines directions for longer-range consistency and multi-source appearance integration, along with ethical considerations for misuse.

Abstract

We present DiffPortrait3D, a conditional diffusion model that is capable of synthesizing 3D-consistent photo-realistic novel views from as few as a single in-the-wild portrait. Specifically, given a single RGB input, we aim to synthesize plausible but consistent facial details rendered from novel camera views with retained both identity and facial expression. In lieu of time-consuming optimization and fine-tuning, our zero-shot method generalizes well to arbitrary face portraits with unposed camera views, extreme facial expressions, and diverse artistic depictions. At its core, we leverage the generative prior of 2D diffusion models pre-trained on large-scale image datasets as our rendering backbone, while the denoising is guided with disentangled attentive control of appearance and camera pose. To achieve this, we first inject the appearance context from the reference image into the self-attention layers of the frozen UNets. The rendering view is then manipulated with a novel conditional control module that interprets the camera pose by watching a condition image of a crossed subject from the same view. Furthermore, we insert a trainable cross-view attention module to enhance view consistency, which is further strengthened with a novel 3D-aware noise generation process during inference. We demonstrate state-of-the-art results both qualitatively and quantitatively on our challenging in-the-wild and multi-view benchmarks.
Paper Structure (36 sections, 5 equations, 18 figures, 3 tables)

This paper contains 36 sections, 5 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Given a single portrait as reference (left), DiffPortrait3D is adept at producing high-fidelity and 3d-consistent novel view synthesis (right). Notably, without any finetuning, DiffPortrait3D is universally effective across a diverse range of facial portraits, encompassing, but not limited to, faces with exaggerated expressions, wide camera views, and artistic depictions.
  • Figure 2: (a) Overview of our DiffPortrait3D framework. Given a single reference image $I_{ref},$ we aim to synthesize its novel views as $I_{T}$ at camera perspectives aligned with condition images $I_{cam}.$ We leverage a pre-trained LDM $\mathcal{F}$ as our image synthesis backbone (middle), where its self-attention layers cross query the appearance context from $I_{ref}$ via our appearance reference module $\mathcal{F}_{ref}$ (right). Our view control module (left) $\mathcal{F}_{cam}$ derives additive view condition from $I_{cam}$ and exerts on $\mathcal{F}$. Additionally, we plug in view consistency modules (dotted rectangles, middle) to $\mathcal{F}$ to enhance multi-view coherence. During training, the images $I_{cam}$ are rendered using an off-the-shelf 3D GAN renderer $R$, where its camera perspectives are aligned with $I_{T}$. (b) The intermediate spatial features $\varphi(\cdot)$ sourced from $I_{ref}$ are concatenated into the corresponding self-attention blocks in $\mathcal{F}$. (c) An attention mechanism is employed across the multi-view dimensions by our view-consistency module.
  • Figure 3: Qualitative comparison of novel view synthesis on in-the-wild images. Compared to the baselines, our method shows superior generalization capability to novel view synthesis of wild portraits with unseen appearances, expressions and styles, even without any reliance on fine-tuning.
  • Figure 4: Qualitative comparison of novel view synthesis on NeRSemble kirschstein2023nersemble. Our method achieves effective view control for novel synthesis with the best perceptual quality and retained identity and expression, even for portraits with exaggerated expressions and under substantial change of camera view for synthesis.
  • Figure 5: Ablation on view consistency. Excessive background variation and slight shading change across multiple novel views are observable without our view-consistency module. Our 3D-aware noise, compared to random Gaussian noise, helps maintain structural coherence during view animation.
  • ...and 13 more figures