3D-SSGAN: Lifting 2D Semantics for 3D-Aware Compositional Portrait Synthesis

Ruiqi Liu; Peng Zheng; Ye Wang; Rui Ma

3D-SSGAN: Lifting 2D Semantics for 3D-Aware Compositional Portrait Synthesis

Ruiqi Liu, Peng Zheng, Ye Wang, Rui Ma

TL;DR

3D-SSGAN addresses the need for fine-grained, part-level control in 3D-aware portrait synthesis. It leverages a depth-guided 2D-to-3D lifting module to obtain 3D part features from per-part 2D generators and uses a 3D-aware semantic mask renderer within a NeRF-based volume rendering framework to produce both images and semantic masks. The model is trained end-to-end with adversarial losses on images and masks, achieving strong semantic disentanglement and robust view consistency across viewpoints. Quantitative and qualitative results show superior part-level editability and competitive image quality compared to state-of-the-art 3D-aware and 2D semantic methods, with notable efficiency advantages over per-part 3D generators. This approach enables controllable, 3D-consistent portrait composition and editing, with potential applications in progressive generation and part-level style interpolation.

Abstract

Existing 3D-aware portrait synthesis methods can generate impressive high-quality images while preserving strong 3D consistency. However, most of them cannot support the fine-grained part-level control over synthesized images. Conversely, some GAN-based 2D portrait synthesis methods can achieve clear disentanglement of facial regions, but they cannot preserve view consistency due to a lack of 3D modeling abilities. To address these issues, we propose 3D-SSGAN, a novel framework for 3D-aware compositional portrait image synthesis. First, a simple yet effective depth-guided 2D-to-3D lifting module maps the generated 2D part features and semantics to 3D. Then, a volume renderer with a novel 3D-aware semantic mask renderer is utilized to produce the composed face features and corresponding masks. The whole framework is trained end-to-end by discriminating between real and synthesized 2D images and their semantic masks. Quantitative and qualitative evaluations demonstrate the superiority of 3D-SSGAN in controllable part-level synthesis while preserving 3D view consistency.

3D-SSGAN: Lifting 2D Semantics for 3D-Aware Compositional Portrait Synthesis

TL;DR

Abstract

Paper Structure (12 sections, 14 equations, 10 figures, 3 tables)

This paper contains 12 sections, 14 equations, 10 figures, 3 tables.

Introduction
Related Work
Method
Preliminaries
Feature Generation
Volume Rendering and Image Synthesis
Training and Implementation Details.
Results
Evaluation
Ablation Studies
Application
Conclusion

Figures (10)

Figure 1: 3D-aware part-level synthesis results of 3D-SSGAN. Semantic parts in target images are transferred on source image by using the same latent code that generates the target part. The images generated from other views and the synthesized masks are shown in the last two rows.
Figure 2: Overview of our training pipeline. First, a depth-guided 2D-to-3D lifting module converts 2D part feature $f_k^{2d}$ and density $\sigma_k^{2d}$ generated by independent 2D generators to 3D. Then, a posed 2D feature $f_{render}$ and an initial mask $m$ are synthesized by a volume renderer which contains a 3D-aware semantic mask renderer. Finally, high-resolution images and masks generated by a render net $R$ are passed to a discriminator $D$ for adversarial training. The green arrows show the data flow for the 3D-aware mask renderer.
Figure 3: The architecture of our 2D generator. The Fourier features are 64-channel constant vectors, which work as position embedding. The blue blocks are $1\times 1$ convolution layers with 64 channels. The green blocks are linear transformation layers.
Figure 4: For each image group, part-level editing to the mouth region is applied to the left image generated by one portrait synthesis method and the result is shown on the right. The difference maps which encode the averaged difference of R, G, B channels are visualized, while the left map is the direct difference between two portrait images, and the right map is obtained by removing the mouth region using the corresponding part mask. The mean value of the whole difference map, i.e., $D_{mean}$ or $D_{mean}^{masked}$ is shown on the top right corner. It can be seen our method outperforms the 3D-aware synthesis methods (IDE-3D, NeRFFaceEditing, CNeRF) in disentangled part editing, while approaching similar performance to the 2D-only synthesis method SemanticStyleGAN.
Figure 5: More comparisons with CNeRF in semantic-disentangled part-level editing. The difference maps, which can reveal the influence to other regions, are computed in the same way as in Fig. \ref{['fig:diff']}. It can be seen our method has smaller influence to the non-edited regions and the superiority is consistant in different views.
...and 5 more figures

3D-SSGAN: Lifting 2D Semantics for 3D-Aware Compositional Portrait Synthesis

TL;DR

Abstract

3D-SSGAN: Lifting 2D Semantics for 3D-Aware Compositional Portrait Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (10)