Table of Contents
Fetching ...

Quark: Real-time, High-resolution, and General Neural View Synthesis

John Flynn, Michael Broxton, Lukas Murmann, Lucy Chai, Matthew DuVall, Clément Godard, Kathryn Heal, Srinivas Kaza, Stephen Lombardi, Xuan Luo, Supreeth Achar, Kira Prabhu, Tiancheng Sun, Lynn Tsai, Ryan Overbeck

TL;DR

A novel neural algorithm for performing high-quality, highresolution, real-time novel view synthesis from a sparse set of input RGB images or videos streams that reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100.

Abstract

We present a novel neural algorithm for performing high-quality, high-resolution, real-time novel view synthesis. From a sparse set of input RGB images or videos streams, our network both reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100. Our feed-forward network generalizes across a wide variety of datasets and scenes and produces state-of-the-art quality for a real-time method. Our quality approaches, and in some cases surpasses, the quality of some of the top offline methods. In order to achieve these results we use a novel combination of several key concepts, and tie them together into a cohesive and effective algorithm. We build on previous works that represent the scene using semi-transparent layers and use an iterative learned render-and-refine approach to improve those layers. Instead of flat layers, our method reconstructs layered depth maps (LDMs) that efficiently represent scenes with complex depth and occlusions. The iterative update steps are embedded in a multi-scale, UNet-style architecture to perform as much compute as possible at reduced resolution. Within each update step, to better aggregate the information from multiple input views, we use a specialized Transformer-based network component. This allows the majority of the per-input image processing to be performed in the input image space, as opposed to layer space, further increasing efficiency. Finally, due to the real-time nature of our reconstruction and rendering, we dynamically create and discard the internal 3D geometry for each frame, generating the LDM for each view. Taken together, this produces a novel and effective algorithm for view synthesis. Through extensive evaluation, we demonstrate that we achieve state-of-the-art quality at real-time rates. Project page: https://quark-3d.github.io/

Quark: Real-time, High-resolution, and General Neural View Synthesis

TL;DR

A novel neural algorithm for performing high-quality, highresolution, real-time novel view synthesis from a sparse set of input RGB images or videos streams that reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100.

Abstract

We present a novel neural algorithm for performing high-quality, high-resolution, real-time novel view synthesis. From a sparse set of input RGB images or videos streams, our network both reconstructs the 3D scene and renders novel views at 1080p resolution at 30fps on an NVIDIA A100. Our feed-forward network generalizes across a wide variety of datasets and scenes and produces state-of-the-art quality for a real-time method. Our quality approaches, and in some cases surpasses, the quality of some of the top offline methods. In order to achieve these results we use a novel combination of several key concepts, and tie them together into a cohesive and effective algorithm. We build on previous works that represent the scene using semi-transparent layers and use an iterative learned render-and-refine approach to improve those layers. Instead of flat layers, our method reconstructs layered depth maps (LDMs) that efficiently represent scenes with complex depth and occlusions. The iterative update steps are embedded in a multi-scale, UNet-style architecture to perform as much compute as possible at reduced resolution. Within each update step, to better aggregate the information from multiple input views, we use a specialized Transformer-based network component. This allows the majority of the per-input image processing to be performed in the input image space, as opposed to layer space, further increasing efficiency. Finally, due to the real-time nature of our reconstruction and rendering, we dynamically create and discard the internal 3D geometry for each frame, generating the LDM for each view. Taken together, this produces a novel and effective algorithm for view synthesis. Through extensive evaluation, we demonstrate that we achieve state-of-the-art quality at real-time rates. Project page: https://quark-3d.github.io/

Paper Structure

This paper contains 34 sections, 15 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Encode Input Images from Fig. \ref{['fig:teaser']}. Quark encodes and downsamples input images using a series of residual networks and strided mean-pooling layers.
  • Figure 2: Update & Fuse Step from Fig. \ref{['fig:teaser']}. During each iteration the Update & Fuse step uses a render-and-refine approach to generate a refined feature volume. (a) First, the feature volume is decoded into an LDM and rendered $M$ times into each of the input viewpoints (see bottom inset). (b) Next, the rendered features are combined with input features $\mathbf{I}_k$ and encoded ray dirs $\gamma_k$ via a residual Feed-forward CNN to generate update features from each view. During iterations where the feature volume is upscaled, the rendered intermediate LDM is upsampled by a factor of two in the spatial dimension and combined with image features at the next level of detail. (c) Updated features are Backprojected into the feature volumes using the same depths $d$ decoded in (a). (d) Finally, updates from all views are combined into a single set of update features $\Delta$ and fed into the Fusion block, which uses across-view attention (top inset) to reason about visibility and update the feature volume. Note that Fusion Block is repeated a variable number of times during each iteration, as is the residual CNN within it. Layer collapse, which reduces the number of layers by a factor of 2 via a residual CNN, is also applied during the final two iterations. See Tabs. \ref{['table:supp_quark']} and \ref{['table:supp_quarkplus']} for these and other per-iteration implementation details.
  • Figure 3: Comparison of Quark to current methods for generalizable neural view synthesis (for numerical results, see Table \ref{['table:comparison']}). Quark preserves image details and preserves thin structures without blurring or image doubling.
  • Figure 4: Comparisons of Quark to methods included in the DL3DV-10K NVS Benchmark (for numerical results, see Table \ref{['table:dl3dv']}). Quark preserves crisp image detail and thin structures while faithfully rendering view-dependent effects like specular highlights and reflective surfaces.
  • Figure 5: Comparisons of methods on scenes from the MipNeRF-360 and Tanks & Temples datasets (for numerical results, see Table \ref{['tab:comparisons3dgs']}). Quark closely matches the resolution of the target image and preserves thin structures as well as or better than other methods.
  • ...and 3 more figures