Table of Contents
Fetching ...

Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder

Antoine Schnepf, Karim Kassab, Jean-Yves Franceschi, Laurent Caraffa, Flavian Vasile, Jeremie Mary, Andrew Comport, Valerie Gouet-Brunet

TL;DR

This work tackles the challenge of enabling NeRFs to operate directly in the latent spaces of image autoencoders by introducing IG-AE, a 3D-aware latent space regularized with synthetic 3D geometry. It presents a two-stage latent NeRF training pipeline (Latent Supervision and RGB Alignment) and couples it with a 3D-regularized autoencoder that preserves reconstruction quality on both synthetic and real data. The approach uses Tri-Planes to model 3D scenes and enforces 3D-consistency in latent space while aligning decoded renderings with RGB views, yielding improved latent NeRF quality over standard AEs and faster training/rendering than RGB-space NeRFs. An open-source Nerfstudio extension enables researchers to train various NeRF models in the latent space, promoting broader exploration of latent NeRFs and 3D-aware representations with practical speedups and interoperability benefits.

Abstract

While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space. Our project page can be found at https://ig-ae.github.io .

Bringing NeRFs to the Latent Space: Inverse Graphics Autoencoder

TL;DR

This work tackles the challenge of enabling NeRFs to operate directly in the latent spaces of image autoencoders by introducing IG-AE, a 3D-aware latent space regularized with synthetic 3D geometry. It presents a two-stage latent NeRF training pipeline (Latent Supervision and RGB Alignment) and couples it with a 3D-regularized autoencoder that preserves reconstruction quality on both synthetic and real data. The approach uses Tri-Planes to model 3D scenes and enforces 3D-consistency in latent space while aligning decoded renderings with RGB views, yielding improved latent NeRF quality over standard AEs and faster training/rendering than RGB-space NeRFs. An open-source Nerfstudio extension enables researchers to train various NeRF models in the latent space, promoting broader exploration of latent NeRFs and 3D-aware representations with practical speedups and interoperability benefits.

Abstract

While pre-trained image autoencoders are increasingly utilized in computer vision, the application of inverse graphics in 2D latent spaces has been under-explored. Yet, besides reducing the training and rendering complexity, applying inverse graphics in the latent space enables a valuable interoperability with other latent-based 2D methods. The major challenge is that inverse graphics cannot be directly applied to such image latent spaces because they lack an underlying 3D geometry. In this paper, we propose an Inverse Graphics Autoencoder (IG-AE) that specifically addresses this issue. To this end, we regularize an image autoencoder with 3D-geometry by aligning its latent space with jointly trained latent 3D scenes. We utilize the trained IG-AE to bring NeRFs to the latent space with a latent NeRF training pipeline, which we implement in an open-source extension of the Nerfstudio framework, thereby unlocking latent scene learning for its supported methods. We experimentally confirm that Latent NeRFs trained with IG-AE present an improved quality compared to a standard autoencoder, all while exhibiting training and rendering accelerations with respect to NeRFs trained in the image space. Our project page can be found at https://ig-ae.github.io .

Paper Structure

This paper contains 36 sections, 14 equations, 14 figures, 16 tables.

Figures (14)

  • Figure 1: 3D-aware latent space. We draw inspiration from the relationship between the 3D space and image space and introduce the concept of a 3D-aware latent space. We propose an Inverse Graphics Autoencoder (IG-AE) that encodes images into 3D-aware latent images, hence preserving 3D-consistency. We use these latents to train scene representations in the 3D-aware latent space.
  • Figure 2: Comparison of IG-AE and a standard AE. Encoding 3D-consistent images using an AE leads to 3D-inconsistent latent images. When trained on such latents, NeRF renderings present artifacts when decoded. IG-AE presents a 3D-aware latent space with 3D-consistent latent images. Latent NeRFs trained with IG-AE eliminate these artifacts and more closely match the ground truth.
  • Figure 3: Latent NeRF Training. We train a Latent NeRF in two stages. First, we train the chosen NeRF method $F_\theta$ on posed encoded latent images using its proprietary loss $\mathcal{L}_{F_\theta}$ that matches rendered latents $\tilde{z}_p$ and encoded latents $z_p$. Subsequently, we align with the scene in the RGB space by adding decoder fine-tuning via $\mathcal{L}_\mathrm{align}$ that matches ground truth images $x_p$ and decoded renderings $\tilde{x}_p$.
  • Figure 4: IG-AE Training. We jointly learn a set of latent synthetic scenes $\mathcal{T}_\tau$ and supervise the latent images $z_{s,p}$ of an autoencoder with rendered 3D-consistent latents $\tilde{z}_{s,p}$ using $\mathcal{L}_\mathrm{latent}$. We match decoded latent renderings $\tilde{x}_{s,p}$ with the ground truth scene renderings $x_{s,p}$ using $\mathcal{L}_\mathrm{RGB}$. We preserve autoencoder performances on synthetic and real data respectively through $\mathcal{L}_\mathrm{ae}^\mathrm{(synth)}$ and $\mathcal{L}_\mathrm{ae}^\mathrm{(real)}$.
  • Figure 5: Qualitative results. Visualization of decoded latent NeRF renderings trained with a standard AE and an IG-AE on scenes from three out-of-distribution datasets. Latent NeRFs trained with an AE exhibit artifacts in decoded renderings that are not present in those trained with IG-AE.
  • ...and 9 more figures