NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

Baao Xie; Bohan Li; Zequn Zhang; Junting Dong; Xin Jin; Jingyu Yang; Wenjun Zeng

NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

Baao Xie, Bohan Li, Zequn Zhang, Junting Dong, Xin Jin, Jingyu Yang, Wenjun Zeng

TL;DR

NaviNeRF is the first work to achieve fine-grained 3D disentanglement without any priors or supervisions, and has a superior fine-grained 3D disentanglement ability than the previous 3D-aware models.

Abstract

3D representation disentanglement aims to identify, decompose, and manipulate the underlying explanatory factors of 3D data, which helps AI fundamentally understand our 3D world. This task is currently under-explored and poses great challenges: (i) the 3D representations are complex and in general contains much more information than 2D image; (ii) many 3D representations are not well suited for gradient-based optimization, let alone disentanglement. To address these challenges, we use NeRF as a differentiable 3D representation, and introduce a self-supervised Navigation to identify interpretable semantic directions in the latent space. To our best knowledge, this novel method, dubbed NaviNeRF, is the first work to achieve fine-grained 3D disentanglement without any priors or supervisions. Specifically, NaviNeRF is built upon the generative NeRF pipeline, and equipped with an Outer Navigation Branch and an Inner Refinement Branch. They are complementary -- the outer navigation is to identify global-view semantic directions, and the inner refinement dedicates to fine-grained attributes. A synergistic loss is further devised to coordinate two branches. Extensive experiments demonstrate that NaviNeRF has a superior fine-grained 3D disentanglement ability than the previous 3D-aware models. Its performance is also comparable to editing-oriented models relying on semantic or geometry priors.

NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

TL;DR

NaviNeRF is the first work to achieve fine-grained 3D disentanglement without any priors or supervisions, and has a superior fine-grained 3D disentanglement ability than the previous 3D-aware models.

Abstract

Paper Structure (14 sections, 7 equations, 9 figures, 3 tables)

This paper contains 14 sections, 7 equations, 9 figures, 3 tables.

Introduction
Related Works
Methodology
Outer Navigation Branch
Inner Refinement Branch
Loss Functions
Experiments
Experimental Settings
Results
Fine-grained 3D Disentanglement
Qualitative Comparison
Quantitative Comparison
Ablation study
Conclusion

Figures (9)

Figure 1: Generated 3D objects by NaviNeRF -- a model aims to achieve fine-grained 3D disentanglement by bridging 3D reconstruction and latent semantic manipulation. The top row presents the results of shifting along the learned semantic direction that represents continuous changes in a man's mouth, visually looks like a "smile" expression. The bottom row showcases the results of multi-view generation, which demonstrates that the attribute manipulation could still remain consistent across different views.
Figure 2: Workflows of standard conditional NeRFs and NaviNeRF. NaviNeRF combines an outer navigation branch and an inner refinement branch by a synergistic loss, for fine-grained 3D disentanglement. Compared with existing solutions, NaviNeRF does not require conditional latent codes or semantic/geometric priors.
Figure 3: Within a latent space $\mathcal{Z}$, the model is proposed to discover and manipulate a interpretable semantic direction from original code $z$ to shifted code $z_s$. Traversing along this direction lead to continue changes on a disentangled representation of generated image.
Figure 4: NaviNeRF is characterized by two complementary branches, termed as the outer navigation branch and the inner refinement branch. The former, depicted in green, appends a shift on sampled latent code $z$ through a learnable matrix $S$. $z$ and shifted code $z_s$ are used to generate paired images, which are devoted to train the decoder $D$ for semantic direction identification. The latter, shown in orange, produces fine-grained awareness and 3D consistency by appending shifts on specific dimensions of intermediate latent vector $w$. The two branches are combined by a synergistic loss, ultimately achieving feature-level 3D disentanglement.
Figure 5: Fine-grained 3D Disentanglement Results of NaviNeRF. The left columns present the results of attribute manipulation and the right columns showcase corresponding 3D reconstruction results. Respectively, (a) demonstrates the semantic manipulation on the FFHQ dataset including the man's mouth, whiskers and the girl's hair; (b) shows the manipulation results on the puppy’s ears, tongue and cheeks.
...and 4 more figures

NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

TL;DR

Abstract

NaviNeRF: NeRF-based 3D Representation Disentanglement by Latent Semantic Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (9)