Table of Contents
Fetching ...

Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous Scenes

Antoine Schnepf, Karim Kassab, Jean-Yves Franceschi, Laurent Caraffa, Flavian Vasile, Jeremie Mary, Andrew Comport, Valérie Gouet-Brunet

TL;DR

The paper tackles scaling neural scene representations to learn a large atlas of similar scenes by introducing a 3D-aware latent space (3Da-AE) and cross-scene information sharing. It combines Encode-Scene, Decode-Scene, and Encode-Decode-Scene strategies with a Tri-Plane representation and a Micro-Macro decomposition to drastically reduce per-scene memory and training time while maintaining rendering quality. The two-stage approach first trains a 3D-aware autoencoder to shape the latent space and then exploits it to efficiently learn thousands of scenes, achieving up to 86% faster training and 44% less memory per scene for 1000 scenes, with PSNR comparable to RGB-based Tri-Planes and a 53% reduction in rendering time. The work offers a practical pathway toward a foundation 3D-aware latent space for scalable 3D scene learning and rendering.

Abstract

We present a method enabling the scaling of NeRFs to learn a large number of semantically-similar scenes. We combine two techniques to improve the required training time and memory cost per scene. First, we learn a 3D-aware latent space in which we train Tri-Plane scene representations, hence reducing the resolution at which scenes are learned. Moreover, we present a way to share common information across scenes, hence allowing for a reduction of model complexity to learn a particular scene. Our method reduces effective per-scene memory costs by 44% and per-scene time costs by 86% when training 1000 scenes. Our project page can be found at https://3da-ae.github.io .

Exploring 3D-aware Latent Spaces for Efficiently Learning Numerous Scenes

TL;DR

The paper tackles scaling neural scene representations to learn a large atlas of similar scenes by introducing a 3D-aware latent space (3Da-AE) and cross-scene information sharing. It combines Encode-Scene, Decode-Scene, and Encode-Decode-Scene strategies with a Tri-Plane representation and a Micro-Macro decomposition to drastically reduce per-scene memory and training time while maintaining rendering quality. The two-stage approach first trains a 3D-aware autoencoder to shape the latent space and then exploits it to efficiently learn thousands of scenes, achieving up to 86% faster training and 44% less memory per scene for 1000 scenes, with PSNR comparable to RGB-based Tri-Planes and a 53% reduction in rendering time. The work offers a practical pathway toward a foundation 3D-aware latent space for scalable 3D scene learning and rendering.

Abstract

We present a method enabling the scaling of NeRFs to learn a large number of semantically-similar scenes. We combine two techniques to improve the required training time and memory cost per scene. First, we learn a 3D-aware latent space in which we train Tri-Plane scene representations, hence reducing the resolution at which scenes are learned. Moreover, we present a way to share common information across scenes, hence allowing for a reduction of model complexity to learn a particular scene. Our method reduces effective per-scene memory costs by 44% and per-scene time costs by 86% when training 1000 scenes. Our project page can be found at https://3da-ae.github.io .
Paper Structure (29 sections, 11 equations, 8 figures, 2 tables, 2 algorithms)

This paper contains 29 sections, 11 equations, 8 figures, 2 tables, 2 algorithms.

Figures (8)

  • Figure 1: 3D-aware latent space. We draw inspiration from the relationship between the 3D space and image space and introduce the idea of a 3D latent space. We propose a 3D-aware autoencoder that encodes images into a 3D-aware (2D) latent image space, in which we train our scene representations.
  • Figure 2: Methods for learning scenes in a 3D-aware latent space. Diagrams for (a) Encode-Scene, (b) Decode-Scene, and (c) Encode-Decode-Scene, the proposed methods to train Tri-Plane scene representations in a 3D-aware latent space.
  • Figure 3: Latent space comparison. Top: ground truth image. Middle: latent image obtained with the 3D-aware encoder. Bottom: latent image obtained with the baseline encoder. Qualitative results show that our 3D-aware encoder better preserves 3D consistency and geometry in the latent space.
  • Figure 4: Latent scenes comparison. Visualization of Tri-Planes renderings and their corresponding decodings after learning scenes in the latent space of a standard AE and our 3D-aware AE. All Tri-Planes are trained using the Encode-Scene pipeline.
  • Figure 5: 3Da-AE training. We learn a 3D-aware latent space by regularizing its training with 3D constraints. To this end, we jointly train the encoder $E_\phi$, the decoder $D_\psi$ and $N$ scenes in this latent space. For each scene $s$, we learn a Tri-Planes representation $T_{s}$, built from the concatenation of local Tri-Planes $T_s^{mic}$ and global Tri-Planes $T_s^{mac}$. $T_s^{mic}$ is retrieved via a one-hot vector $e_s$ from a set of scene-specific planes stored in memory. $T_s^{mac}$ is computed from a summation of $M$ globally shared Tri-Planes, weighted with weights $W_s$.
  • ...and 3 more figures