Table of Contents
Fetching ...

Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

Sixu Li, Chaojian Li, Wenbo Zhu, Boyang Yu, Yang Zhao, Cheng Wan, Haoran You, Huihong Shi, Yingyan Celine Lin

TL;DR

This work tackles the challenge of instant on-device NeRF training for AR/VR by identifying embedding-grid interpolation as the main bottleneck and proposing an algorithm-hardware co-design called Instant-3D. The core idea is to decompose the 3D embedding grid into color and density branches with distinct grid sizes and update frequencies, and to implement a specialized accelerator featuring a feed-forward read mapper, back-propagation update merger, and a reconfigurable multi-core grid to reduce memory accesses and adapt to different grid configurations. Empirical results show substantial training-time reductions (tens to hundreds of times faster) while maintaining reconstruction quality, with on-device scene reconstruction in about 1.6 seconds and under a 1.9 W power envelope. The work demonstrates a path to practical instant NeRF-based AR/VR on edge devices through tight integration of algorithm design and hardware architecture.

Abstract

Neural Radiance Field (NeRF) based 3D reconstruction is highly desirable for immersive Augmented and Virtual Reality (AR/VR) applications, but achieving instant (i.e., < 5 seconds) on-device NeRF training remains a challenge. In this work, we first identify the inefficiency bottleneck: the need to interpolate NeRF embeddings up to 200,000 times from a 3D embedding grid during each training iteration. To alleviate this, we propose Instant-3D, an algorithm-hardware co-design acceleration framework that achieves instant on-device NeRF training. Our algorithm decomposes the embedding grid representation in terms of color and density, enabling computational redundancy to be squeezed out by adopting different (1) grid sizes and (2) update frequencies for the color and density branches. Our hardware accelerator further reduces the dominant memory accesses for embedding grid interpolation by (1) mapping multiple nearby points' memory read requests into one during the feed-forward process, (2) merging embedding grid updates from the same sliding time window during back-propagation, and (3) fusing different computation cores to support the different grid sizes needed by the color and density branches of Instant-3D algorithm. Extensive experiments validate the effectiveness of Instant-3D, achieving a large training time reduction of 41x - 248x while maintaining the same reconstruction quality. Excitingly, Instant-3D has enabled instant 3D reconstruction for AR/VR, requiring a reconstruction time of only 1.6 seconds per scene and meeting the AR/VR power consumption constraint of 1.9 W.

Instant-3D: Instant Neural Radiance Field Training Towards On-Device AR/VR 3D Reconstruction

TL;DR

This work tackles the challenge of instant on-device NeRF training for AR/VR by identifying embedding-grid interpolation as the main bottleneck and proposing an algorithm-hardware co-design called Instant-3D. The core idea is to decompose the 3D embedding grid into color and density branches with distinct grid sizes and update frequencies, and to implement a specialized accelerator featuring a feed-forward read mapper, back-propagation update merger, and a reconfigurable multi-core grid to reduce memory accesses and adapt to different grid configurations. Empirical results show substantial training-time reductions (tens to hundreds of times faster) while maintaining reconstruction quality, with on-device scene reconstruction in about 1.6 seconds and under a 1.9 W power envelope. The work demonstrates a path to practical instant NeRF-based AR/VR on edge devices through tight integration of algorithm design and hardware architecture.

Abstract

Neural Radiance Field (NeRF) based 3D reconstruction is highly desirable for immersive Augmented and Virtual Reality (AR/VR) applications, but achieving instant (i.e., < 5 seconds) on-device NeRF training remains a challenge. In this work, we first identify the inefficiency bottleneck: the need to interpolate NeRF embeddings up to 200,000 times from a 3D embedding grid during each training iteration. To alleviate this, we propose Instant-3D, an algorithm-hardware co-design acceleration framework that achieves instant on-device NeRF training. Our algorithm decomposes the embedding grid representation in terms of color and density, enabling computational redundancy to be squeezed out by adopting different (1) grid sizes and (2) update frequencies for the color and density branches. Our hardware accelerator further reduces the dominant memory accesses for embedding grid interpolation by (1) mapping multiple nearby points' memory read requests into one during the feed-forward process, (2) merging embedding grid updates from the same sliding time window during back-propagation, and (3) fusing different computation cores to support the different grid sizes needed by the color and density branches of Instant-3D algorithm. Extensive experiments validate the effectiveness of Instant-3D, achieving a large training time reduction of 41x - 248x while maintaining the same reconstruction quality. Excitingly, Instant-3D has enabled instant 3D reconstruction for AR/VR, requiring a reconstruction time of only 1.6 seconds per scene and meeting the AR/VR power consumption constraint of 1.9 W.
Paper Structure (21 sections, 3 equations, 18 figures, 5 tables)

This paper contains 21 sections, 3 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: An illustration of NeRF-based 3D reconstruction, which takes 2D images from a set of sparsely sampled views of a 3D scene as its inputs and then generates images of the same scene from any desired new view.
  • Figure 2: NeRF mildenhall2020nerf's training process involves a total of six steps: Step ❶ randomly samples pixels as a batch, Step ❷ maps the sampled pixels to rays $\mathbf{r} = \mathbf{o}+t\mathbf{d}$ by emitting rays to pass through the corresponding pixels, Step ❸ queries the features (i.e., the RGB color and the density $\sigma$) of points along the rays by providing their locations and directions as the inputs to an MLP model, Step ❹ predicts the pixels' colors following the principle of classical volume rendering max1995optical, Step ❺ computes the loss as the squared error between the predicted colors and ground truth colors, and Step ❻ back-propagates through the above fully differentiable pipeline.
  • Figure 3: Instant-NGP muller2022instant achieves SOTA training efficiency by replacing Step ❸ (i.e., querying the features of points along the rays using a large 10-layer MLP model) in vanilla NeRFs mildenhall2020nerf with both Step ❸-① - Interpolating embeddings from the embedding grid and Step ❸-② - Computing the features of the queried points using a small MLP model.
  • Figure 4: Training runtime breakdown averaged on the eight scenes of NeRF-Synthetic mildenhall2020nerf on three representative commercial devices, suggesting that the most efficient NeRF training algorithm muller2022instant is bottlenecked by Step ❸-① (i.e., interpolating embeddings from the embedding grid) and its corresponding back-propagation process on all considered scenes and devices.
  • Figure 5: (a) Color and density feature visualization during training: Colors are learned faster than the densities under the same number of training iterations. Here we can see that the color features are of higher quality than those of the density under the same number of training iterations (i.e., at the 160th iteration) on the Ficus scene mildenhall2020nerf, where the ground truth color and density features are shown as a reference. (b) Quantified PSNR of the color and density features during the whole training trajectory: The PSNR of the color features is consistently higher than that of the density features during the whole training process. Here the plot shows the average RGB/depth images PSNR on the eight scenes mildenhall2020nerf vs. the number of training iterations.
  • ...and 13 more figures