LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

Cong Wang; Yu-Ping Wang; Dinesh Manocha

LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

Cong Wang, Yu-Ping Wang, Dinesh Manocha

TL;DR

A novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views and introduces a self-attention mechanism to improve occlusion inference and presents a Block-Sampling Self-Attention module to address the problem of applying self-attention to large feature maps.

Abstract

We propose a novel method, LoLep, which regresses Locally-Learned planes from a single RGB image to represent scenes accurately, thus generating better novel views. Without the depth information, regressing appropriate plane locations is a challenging problem. To solve this issue, we pre-partition the disparity space into bins and design a disparity sampler to regress local offsets for multiple planes in each bin. However, only using such a sampler makes the network not convergent; we further propose two optimizing strategies that combine with different disparity distributions of datasets and propose an occlusion-aware reprojection loss as a simple yet effective geometric supervision technique. We also introduce a self-attention mechanism to improve occlusion inference and present a Block-Sampling Self-Attention (BS-SA) module to address the problem of applying self-attention to large feature maps. We demonstrate the effectiveness of our approach and generate state-of-the-art results on different datasets. Compared to MINE, our approach has an LPIPS reduction of 4.8%-9.0% and an RV reduction of 73.9%-83.5%. We also evaluate the performance on real-world images and demonstrate the benefits.

LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

TL;DR

Abstract

Paper Structure (18 sections, 9 equations, 6 figures, 5 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 6 figures, 5 tables, 1 algorithm.

Introduction
Related Works
Background
Our Method
Locally-Learned Planes
The Point for Locally-Learned Planes
Disparity Sampler
Occlusion-Aware Reprojection Loss
Self-Attention Occlusion Inference
Implementation and Results
Rendering Variance
View Synthesis on KITTI
View Synthesis on RealEstate10K
View Synthesis on Flowers Light Fields
Depth Evaluation on NYU-V2 and iBims-1
...and 3 more sections

Figures (6)

Figure 1: Comparisons on the KITTI dataset. LoLep generates state-of-the-art results and even LoLep with fewer planes uses less memory and generates better novel views than previous methods with more planes (LoLep-16 vs. MINE-32, MINE-64 and MPI-32, LoLep-32 vs. MINE-64), which benefits from locally-learned planes and self-attention occlusion inference. The batch size is 4.
Figure 2: Overview. LoLep regresses locally-learned planes to represent scenes accurately without a depth map input mainly relying on three novel components. (a) Disparity Sampler: regressing accurate locations for multiple planes from only the RGB image; (b) Occlusion-aware Reprojection Loss: a simple yet effective geometric supervision technique for single-view view synthesis to learn better geometry; (c) Block-Sampling Self-Attention: supporting self-attention applied to large feature maps for higher performance. '$\oplus$' concatenates two tensors.
Figure 3: Block-Sampling Self-Attention Module. The block-sampling self-attention module reduces the size of the attention matrix from $HW \times HW$ to $M \times HW$ and solves the issue that the original self-attention mechanism cannot be applied to large feature maps. $M$ is a hyper-parameter. "$\otimes$" denotes matrix multiplication. The softmax operation is performed on each row.
Figure 4: Qualitative comparison on the KITTI dataset. All images are from the test dataset and highlight the benefits of LoLep. (A) MINE synthesizes a broken pole. (B) MINE fails to infer occluded regions, thereby causing ghosting. (C) MINE regresses a suboptimal scene representation, thereby generating ghosting. (D) MINE synthesizes a twisted pole due to inconsistent depths of the pole.
Figure 5: Qualitative comparison on the RealEstate10K dataset. (A) MINE fails to infer the geometry of the balustrade in stairs. (B) MINE generates many artifacts and blurry regions. In contrast, LoLep generates improved results.
...and 1 more figures

LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

TL;DR

Abstract

LoLep: Single-View View Synthesis with Locally-Learned Planes and Self-Attention Occlusion Inference

Authors

TL;DR

Abstract

Table of Contents

Figures (6)