Table of Contents
Fetching ...

GenN2N: Generative NeRF2NeRF Translation

Xiangyue Liu, Han Xue, Kunming Luo, Ping Tan, Li Yi

TL;DR

GenN2N tackles universal NeRF editing by translating 2D image edits into 3D NeRF space. It blends a plug-and-play 2D image-to-image translator with a conditional 3D VAE-GAN that models the distribution of possible 3D edits via a latent code $z$ drawn from a Gaussian, and it enforces 3D-consistency through a differentiable volume renderer coupled with reconstruction, adversarial, and contrastive losses. The approach supports text-driven editing, colorization, super-resolution, and inpainting, delivering diverse, multi-view-consistent results that competitive task-specific baselines. This framework enables flexible, efficient NeRF editing by plugging in different 2D editors and sampling diverse 3D edits at inference, with strong empirical performance across a variety of scenes and tasks.

Abstract

We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: https://xiangyueliu.github.io/GenN2N/

GenN2N: Generative NeRF2NeRF Translation

TL;DR

GenN2N tackles universal NeRF editing by translating 2D image edits into 3D NeRF space. It blends a plug-and-play 2D image-to-image translator with a conditional 3D VAE-GAN that models the distribution of possible 3D edits via a latent code drawn from a Gaussian, and it enforces 3D-consistency through a differentiable volume renderer coupled with reconstruction, adversarial, and contrastive losses. The approach supports text-driven editing, colorization, super-resolution, and inpainting, delivering diverse, multi-view-consistent results that competitive task-specific baselines. This framework enables flexible, efficient NeRF editing by plugging in different 2D editors and sampling diverse 3D edits at inference, with strong empirical performance across a variety of scenes and tasks.

Abstract

We present GenN2N, a unified NeRF-to-NeRF translation framework for various NeRF translation tasks such as text-driven NeRF editing, colorization, super-resolution, inpainting, etc. Unlike previous methods designed for individual translation tasks with task-specific schemes, GenN2N achieves all these NeRF editing tasks by employing a plug-and-play image-to-image translator to perform editing in the 2D domain and lifting 2D edits into the 3D NeRF space. Since the 3D consistency of 2D edits may not be assured, we propose to model the distribution of the underlying 3D edits through a generative model that can cover all possible edited NeRFs. To model the distribution of 3D edited NeRFs from 2D edited images, we carefully design a VAE-GAN that encodes images while decoding NeRFs. The latent space is trained to align with a Gaussian distribution and the NeRFs are supervised through an adversarial loss on its renderings. To ensure the latent code does not depend on 2D viewpoints but truly reflects the 3D edits, we also regularize the latent code through a contrastive learning scheme. Extensive experiments on various editing tasks show GenN2N, as a universal framework, performs as well or better than task-specific specialists while possessing flexible generative power. More results on our project page: https://xiangyueliu.github.io/GenN2N/
Paper Structure (27 sections, 6 equations, 20 figures, 8 tables)

This paper contains 27 sections, 6 equations, 20 figures, 8 tables.

Figures (20)

  • Figure 1: We introduce GenN2N, a unified framework for NeRF-to-NeRF translation, enabling a range of 3D NeRF editing tasks, including text-driven editing, colorization, super-resolution, inpainting, etc. We show at least two rendering views of edited NeRF scenes at inference time. Given a 3D NeRF scene, GenN2N can produce high-quality editing results with suitable multi-view consistency.
  • Figure 2: Overview of GenN2N. We first edit the source image set $\{ \mathbf{I}_i\}^{N-1}_{i=0}$ using 2D image-to-image translation methods, e.g., text-driven editing, colorization, zoom out, etc. For each view $i\in [0,N-1]$, we generate $M$ edited images, resulting in a group of translated image set $\{ \{ \mathbf{S}^j_i\}^{M-1}_{j=0} \}^{N-1}_{i=0}$. Then we use the Latent Distill Module to learn $M \times N$ edit code vectors from the translated image set, which serve as the input of the translated NeRF. To optimize our GenN2N, we design four loss functions: a KL loss to constrain the latent vectors to a Gaussian distribution; and $\mathcal{L}_{\textrm{recon}}$, $\mathcal{L}_{\textrm{adv}}$ and $\mathcal{L}_{\textrm{contr}}$ to optimize the appearance and geometry of the translated NeRF. At inference, we can sample a latent vector $\mathbf{z}$ from Gaussian distribution and render a corresponding multi-view consistent 3D scene with high quality.
  • Figure 3: Illustration of our proposed contrastive loss functions. Regarding the multi-view rendered images $\mathbf{C}_i^j$ and $\mathbf{C}_l^j$ sharing the same edit code, we resend them to our Latent Distill Module to extract ${\mathbf{z}}_i^j$ and ${\mathbf{z}}_l^j$, and aggregate them via $\mathcal{L}_{\textrm{contr}}^{\textrm{att}}$. In addition, for $\mathbf{S}_i^k$ whose editing style vary from $\mathbf{S}_i^j$, $\mathcal{L}_{\textrm{contr}}^{\textrm{rep}}$ increase the distance between edit codes of them.
  • Figure 4: Illustration of our proposed conditional adversarial loss functions. Our conditional discriminator distinguishes artifacts such as blur and distortion in novel-view rendered image $\mathbf{C}_l^j$ compared with target image $\mathbf{S}_l^j$. $\mathbf{S}_l^j$ and $\mathbf{S}_l^k$ are edited with same view but various styles, the latter serves as the condition to concatenate with $\mathbf{C}_l^j$ and $\mathbf{S}_l^j$ and manufacture fake and real pairs.
  • Figure 5: Text-Driven Editing. We sample 4 inference results for both text-driven editing tasks. The diversity of geometry and appearance showcases awesome generative ability of GenN2N, on the premise of maintaining the 3D consistency between different viewpoints.
  • ...and 15 more figures