Table of Contents
Fetching ...

Neural 3D Video Synthesis from Multi-view Video

Tianye Li, Mira Slavcheva, Michael Zollhoefer, Simon Green, Christoph Lassner, Changil Kim, Tanner Schmidt, Steven Lovegrove, Michael Goesele, Richard Newcombe, Zhaoyang Lv

TL;DR

This work introduces DyNeRF, a dynamic neural radiance field that represents multi-view dynamic scenes using time-conditioned latent embeddings, enabling continuous space-time rendering with a compact representation. It combines hierarchical training and ray importance sampling to dramatically speed training and improve visual quality, achieving near-photorealistic 1K views from 10-second sequences captured with 18 cameras in as little as 28 MB. The method outperforms static NeRF extensions and prior dynamic approaches on quantitative and perceptual metrics, while enabling motion interpolation and sub-frame temporal control. The authors provide datasets and discuss limitations, outlining directions for future work in more challenging motions and camera setups.

Abstract

We propose a novel approach for 3D video synthesis that is able to represent multi-view video recordings of a dynamic real-world scene in a compact, yet expressive representation that enables high-quality view synthesis and motion interpolation. Our approach takes the high quality and compactness of static neural radiance fields in a new direction: to a model-free, dynamic setting. At the core of our approach is a novel time-conditioned neural radiance field that represents scene dynamics using a set of compact latent codes. We are able to significantly boost the training speed and perceptual quality of the generated imagery by a novel hierarchical training scheme in combination with ray importance sampling. Our learned representation is highly compact and able to represent a 10 second 30 FPS multiview video recording by 18 cameras with a model size of only 28MB. We demonstrate that our method can render high-fidelity wide-angle novel views at over 1K resolution, even for complex and dynamic scenes. We perform an extensive qualitative and quantitative evaluation that shows that our approach outperforms the state of the art. Project website: https://neural-3d-video.github.io/.

Neural 3D Video Synthesis from Multi-view Video

TL;DR

This work introduces DyNeRF, a dynamic neural radiance field that represents multi-view dynamic scenes using time-conditioned latent embeddings, enabling continuous space-time rendering with a compact representation. It combines hierarchical training and ray importance sampling to dramatically speed training and improve visual quality, achieving near-photorealistic 1K views from 10-second sequences captured with 18 cameras in as little as 28 MB. The method outperforms static NeRF extensions and prior dynamic approaches on quantitative and perceptual metrics, while enabling motion interpolation and sub-frame temporal control. The authors provide datasets and discuss limitations, outlining directions for future work in more challenging motions and camera setups.

Abstract

We propose a novel approach for 3D video synthesis that is able to represent multi-view video recordings of a dynamic real-world scene in a compact, yet expressive representation that enables high-quality view synthesis and motion interpolation. Our approach takes the high quality and compactness of static neural radiance fields in a new direction: to a model-free, dynamic setting. At the core of our approach is a novel time-conditioned neural radiance field that represents scene dynamics using a set of compact latent codes. We are able to significantly boost the training speed and perceptual quality of the generated imagery by a novel hierarchical training scheme in combination with ray importance sampling. Our learned representation is highly compact and able to represent a 10 second 30 FPS multiview video recording by 18 cameras with a model size of only 28MB. We demonstrate that our method can render high-fidelity wide-angle novel views at over 1K resolution, even for complex and dynamic scenes. We perform an extensive qualitative and quantitative evaluation that shows that our approach outperforms the state of the art. Project website: https://neural-3d-video.github.io/.

Paper Structure

This paper contains 28 sections, 7 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: We propose a novel method for representing and rendering high quality 3D video. Our method trains a novel and compact dynamic neural radiance field (DyNeRF) in an efficient way. Our method demonstrates near photorealistic dynamic novel view synthesis for complex scenes including challenging scene motions and strong view-dependent effects. We demonstrate three synthesized 3D video, and show the associated high quality geometry in the heatmap visualization in each top right corner. The embedded animations only play in Adobe Reader or KDE Okular. Please see the https://neural-3d-video.github.io/resources/video.mp4 for the high-quality renderings and additional information.
  • Figure 2: We learn the 6D plenoptic function by our novel dynamic neural radiance field (DyNeRF) that conditions on position, view direction and a compact, yet expressive time-variant latent code.
  • Figure 3: Overview of our efficient training strategies. We perform hierarchical training first using keyframes (b) and then on the full sequence (c). At both stages, we apply the ray importance sampling technique to focus on the rays with high time-variant information based on weight maps that measure the temporal appearance changes (a). We show a visualized example of the sampling probability based on global median map using a heatmap (red and opaque means high probability).
  • Figure 4: High-quality novel view videos synthesized by our approach for dynamic real-world scenes. We visualize normalized depth in color space on the last column in the each row. Our representation is compact, yet expressive and even handles complex specular reflections and translucency.
  • Figure 5: Comparison of our final model to existing methods, including Multi-view Stereo (MVS), local light field fusion (LLFF)mildenhall2019local and NeuralVolume (NV) Lombardi19tog. The first row shows novel view rendering on a test view. The second row visualizes the FLIP compared to the ground truth image. Compared to alternative methods, our method can achieve best visual quality.
  • ...and 7 more figures