Global Latent Neural Rendering

Thomas Tanay; Matteo Maggioni

Global Latent Neural Rendering

Thomas Tanay, Matteo Maggioni

TL;DR

This work proposes the Convolutional Global Latent Renderer (ConvGLR), an efficient convolutional architecture that performs the rendering operation globally in a low-resolution latent space and consistently outperforms existing methods by significant margins.

Abstract

A recent trend among generalizable novel view synthesis methods is to learn a rendering operator acting over single camera rays. This approach is promising because it removes the need for explicit volumetric rendering, but it effectively treats target images as collections of independent pixels. Here, we propose to learn a global rendering operator acting over all camera rays jointly. We show that the right representation to enable such rendering is a 5-dimensional plane sweep volume consisting of the projection of the input images on a set of planes facing the target camera. Based on this understanding, we introduce our Convolutional Global Latent Renderer (ConvGLR), an efficient convolutional architecture that performs the rendering operation globally in a low-resolution latent space. Experiments on various datasets under sparse and generalizable setups show that our approach consistently outperforms existing methods by significant margins.

Global Latent Neural Rendering

TL;DR

Abstract

Paper Structure (32 sections, 9 figures, 9 tables)

This paper contains 32 sections, 9 figures, 9 tables.

Introduction
Related work
NeRFs
Light fields
Ray transformers
Explicit geometry
Multiplane images
3D cost volumes
Background
Method
The Plane Sweep Volume
Global Latent Neural Rendering
PSV grouping
Multi-view matching
Global latent rendering
...and 17 more sections

Figures (9)

Figure 1: Qualitative comparison of our method with various baselines under 5 different experimental setups. Our method renders target views in a low-resolution latent space and operates over all camera rays jointly. It produces significantly better geometries and textures than previous sparse and generalizable methods, which render light rays independently and typically suffer from grainy artifacts.
Figure 2: Epipolar lines and the plane sweep volume. Left: The camera ray passing through the pixel location $(h,w)$ in the target view projects as a set of epipolar lines in the input views. Ray transformers wang2021ibrnetsuhail2022generalizabledu2023crosst2023is process information sampled along these epipolar lines to predict the color of the target pixel $(h,w)$. Right: The set of camera rays passing through adjacent pixel locations in the target view project as corresponding sets of epipolar lines in the input views. Sampling along these sets of epipolar lines at constant depths defines a plane sweep volume facing the target view. Processing this plane sweep volume allows to render adjacent camera rays jointly.
Figure 3: Overview of ConvGLR. The 4D grouped PSV $\bm{X}$ is turned into a latent volumetric representation $\bm{Y}$, then rendered into a latent novel view $\bm{Z}$ and finally upsampled into the novel view $\bm{\tilde{I}}_\ast$. All the dark gray blocks are implemented with 2D convolutions and resblocks.
Figure 4: The epipolar geometry of the plane sweep volume. 1. The PSV is constructed by projecting each input view on a set of planes distributed parallel to the target image plane. 2. The camera ray passing through the pixel location (h, w) in the target image plane (gray line in 1.) projects as a set of epipolar lines in the input views (white lines in 3.). 4. Moving along the depth dimension of the PSV at pixel location (h, w) is equivalent to moving along the corresponding epipolar lines for each input view. The actual depth of the object at pixel location (h, w) is found when the local image features match across views (yellow dot).
Figure 5: Averaging the plane sweep volume. 1. The target view for which a plane sweep volume is constructed, using 9 input views (not including the target view) and near and far bounds that are close to the object depth. 2. Averaging the PSV over views and depths provides a blurry estimate of the target views. 3. Averaging the PSV over the views brings successive depths of the object into focus.
...and 4 more figures

Global Latent Neural Rendering

TL;DR

Abstract

Global Latent Neural Rendering

Authors

TL;DR

Abstract

Table of Contents

Figures (9)