Real-time High-resolution View Synthesis of Complex Scenes with Explicit 3D Visibility Reasoning

Tiansong Zhou; Yebin Liu; Xuangeng Chu; Chengkun Cao; Changyin Zhou; Fei Yu; Yu Li

Real-time High-resolution View Synthesis of Complex Scenes with Explicit 3D Visibility Reasoning

Tiansong Zhou, Yebin Liu, Xuangeng Chu, Chengkun Cao, Changyin Zhou, Fei Yu, Yu Li

TL;DR

Experimental results show that the proposed generalizable view synthesis method outperforms previous view synthesis methods in both rendering quality and speed, particularly when dealing with complex dynamic scenes with sparse views.

Abstract

Rendering photo-realistic novel-view images of complex scenes has been a long-standing challenge in computer graphics. In recent years, great research progress has been made on enhancing rendering quality and accelerating rendering speed in the realm of view synthesis. However, when rendering complex dynamic scenes with sparse views, the rendering quality remains limited due to occlusion problems. Besides, for rendering high-resolution images on dynamic scenes, the rendering speed is still far from real-time. In this work, we propose a generalizable view synthesis method that can render high-resolution novel-view images of complex static and dynamic scenes in real-time from sparse views. To address the occlusion problems arising from the sparsity of input views and the complexity of captured scenes, we introduce an explicit 3D visibility reasoning approach that can efficiently estimate the visibility of sampled 3D points to the input views. The proposed visibility reasoning approach is fully differentiable and can gracefully fit inside the volume rendering pipeline, allowing us to train our networks with only multi-view images as supervision while refining geometry and texture simultaneously. Besides, each module in our pipeline is carefully designed to bypass the time-consuming MLP querying process and enhance the rendering quality of high-resolution images, enabling us to render high-resolution novel-view images in real-time.Experimental results show that our method outperforms previous view synthesis methods in both rendering quality and speed, particularly when dealing with complex dynamic scenes with sparse views.

Real-time High-resolution View Synthesis of Complex Scenes with Explicit 3D Visibility Reasoning

TL;DR

Abstract

Paper Structure (32 sections, 20 equations, 7 figures, 5 tables)

This paper contains 32 sections, 20 equations, 7 figures, 5 tables.

Introduction
Related Work
NeRF Works
NeRF Acceleration
Visibility Reasoning
Method
Feature Encoding
Discretized Geometry Volumes
Feature volume construction
Density volume regression
Continuous Texture Volumes
Ray hierarchical sampling
Explicit 3D visibility reasoning
Ray integration
Rendering
...and 17 more sections

Figures (7)

Figure 1: Our method achieves real-time for rendering high-resolution dynamic scenes with high visual quality, enabling users to seamlessly transition to desired perspectives at any time. At the heart of our method is explicit 3D visibility reasoning, which efficiently estimates the visibility of the sampled 3D points to the input views, helping us to address the occlusion problems arising from the sparsity of input views and the complexity of captured scenes.
Figure 2: The overall pipeline of our system. Our pipeline firstly uses an encoder-net $\mathcal{E}$ to extract geometry and texture feature maps from the input images. In the geometry volumes branch, we construct a 3D feature volume on the novel view's camera frustum based on the extracted geometry feature maps. Then, we use a 3D CNN $\mathcal{O}$ to regress a density volume of the novel view. In the texture volumes branch, we hierarchically sample 3D points in each marching ray of the novel view. Then, we project each sampled 3D point to input views and grab features on the texture feature maps. To aggregate the grabbed multi-view features, we use explicit 3D visibility reasoning to get the weight of each view. After aggregating multi-view features and performing ray integration for all the rays, we get a low-resolution feature map $F_{novel}^{lr}$. Apart from the feature map, we also interpolate a low-resolution color image $I_{novel}^{inter}$ as an additional supervision signal to make our pipeline more geometrically interpretable. Finally, we use a render-net $\mathcal{R}$ to render the high-resolution color image $I_{novel}^{hr}$ from the low-resolution feature map $F_{novel}^{lr}$, in which we procedurally up-sample the $F_{novel}^{lr}$ to high resolution in the render-net $\mathcal{R}$. Our method is fully differentiable and can be trained with only sparse multi-view images as supervision.
Figure 3: Our explicit 3D visibility reasoning. Based on the regressed density volume on the novel view (the first column), we build a volume in each input view's camera frustum and use the constructed volume to re-sample the novel view's density volume (the second column), getting the density volumes of input views (the third column). Then, based on the re-sampled density volumes, we calculate the visibility volumes (the fourth column) using Equation \ref{['eq:alpha']} and Equation \ref{['eq:visibility']}. Finally, for each hierarchically sampled ray in the novel view (the fifth column), we transform the sampled 3D points to input views' visibility volumes and get visibility weights by tri-linearly interpolating the visibility volumes (the sixth column).
Figure 4: Visual comparisons on static scenes. "ft-15min" means fine-tuning the generalization models for $\sim$15 minutes. "sc-15min" means training networks from scratch for $\sim$15 minutes. Our method generally shows competitive rendering results with the baselines. On the occluded areas of the scene in the LLFF dataset, our method achieves better results compared with previous generalizable methods (ENeRF, Neuray), demonstrating the efficiency of the proposed explicit 3D visibility reasoning. (Best viewed with zooming in on the page.)
Figure 5: Fine-tuning on the "Leaves" scene. As the render-net $\mathcal{R}$ is a texture prior of the training data and the "Leave" scene in the LLFF dataset has very different texture from the training scenes in the DTU dataset, the rendering quality of generalization model decreases. We show that the rendering quality can be significantly improved after fine-tuning for only 1500 iterations ( 15 min).
...and 2 more figures

Real-time High-resolution View Synthesis of Complex Scenes with Explicit 3D Visibility Reasoning

TL;DR

Abstract

Real-time High-resolution View Synthesis of Complex Scenes with Explicit 3D Visibility Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (7)