Table of Contents
Fetching ...

Bridging Implicit and Explicit Geometric Transformation for Single-Image View Synthesis

Byeongjun Park, Hyojun Go, Changick Kim

TL;DR

This work tackles the seesaw problem in single-image view synthesis, where preserving reprojected seen content and realistically filling unseen regions are at odds. It introduces a non-autoregressive framework that bridges explicit and implicit geometric transformations via two parallel renderers and a depth-informed GLSA encoder, unified by a transformation similarity loss. The method achieves state-of-the-art performance on RealEstate10K and ACID, with PSNR-vis improvements and the lowest FID, while delivering approximately $\approx 100\times$ faster inference than autoregressive rivals. By combining depth-driven 3D geometry, global-local point-cloud attention, and a balanced renderer collaboration, the approach enables real-time, high-fidelity novel-view synthesis and shows promise for extrapolation tasks beyond single-view setups.

Abstract

Creating novel views from a single image has achieved tremendous strides with advanced autoregressive models, as unseen regions have to be inferred from the visible scene contents. Although recent methods generate high-quality novel views, synthesizing with only one explicit or implicit 3D geometry has a trade-off between two objectives that we call the "seesaw" problem: 1) preserving reprojected contents and 2) completing realistic out-of-view regions. Also, autoregressive models require a considerable computational cost. In this paper, we propose a single-image view synthesis framework for mitigating the seesaw problem while utilizing an efficient non-autoregressive model. Motivated by the characteristics that explicit methods well preserve reprojected pixels and implicit methods complete realistic out-of-view regions, we introduce a loss function to complement two renderers. Our loss function promotes that explicit features improve the reprojected area of implicit features and implicit features improve the out-of-view area of explicit features. With the proposed architecture and loss function, we can alleviate the seesaw problem, outperforming autoregressive-based state-of-the-art methods and generating an image $\approx$100 times faster. We validate the efficiency and effectiveness of our method with experiments on RealEstate10K and ACID datasets.

Bridging Implicit and Explicit Geometric Transformation for Single-Image View Synthesis

TL;DR

This work tackles the seesaw problem in single-image view synthesis, where preserving reprojected seen content and realistically filling unseen regions are at odds. It introduces a non-autoregressive framework that bridges explicit and implicit geometric transformations via two parallel renderers and a depth-informed GLSA encoder, unified by a transformation similarity loss. The method achieves state-of-the-art performance on RealEstate10K and ACID, with PSNR-vis improvements and the lowest FID, while delivering approximately faster inference than autoregressive rivals. By combining depth-driven 3D geometry, global-local point-cloud attention, and a balanced renderer collaboration, the approach enables real-time, high-fidelity novel-view synthesis and shows promise for extrapolation tasks beyond single-view setups.

Abstract

Creating novel views from a single image has achieved tremendous strides with advanced autoregressive models, as unseen regions have to be inferred from the visible scene contents. Although recent methods generate high-quality novel views, synthesizing with only one explicit or implicit 3D geometry has a trade-off between two objectives that we call the "seesaw" problem: 1) preserving reprojected contents and 2) completing realistic out-of-view regions. Also, autoregressive models require a considerable computational cost. In this paper, we propose a single-image view synthesis framework for mitigating the seesaw problem while utilizing an efficient non-autoregressive model. Motivated by the characteristics that explicit methods well preserve reprojected pixels and implicit methods complete realistic out-of-view regions, we introduce a loss function to complement two renderers. Our loss function promotes that explicit features improve the reprojected area of implicit features and implicit features improve the out-of-view area of explicit features. With the proposed architecture and loss function, we can alleviate the seesaw problem, outperforming autoregressive-based state-of-the-art methods and generating an image 100 times faster. We validate the efficiency and effectiveness of our method with experiments on RealEstate10K and ACID datasets.
Paper Structure (47 sections, 17 equations, 21 figures, 12 tables)

This paper contains 47 sections, 17 equations, 21 figures, 12 tables.

Figures (21)

  • Figure 1: Seesaw problem of explicit and implicit methods. Explicit methods well preserve warped contents but sacrifice to fill unseen pixels ($\uparrow$ PSNR on small view change, $\uparrow$ FID on large view change). Implicit methods amply fill unseen pixels but fall short of preserving seen contents ($\downarrow$ PSNR on small view change, $\downarrow$ FID on large view change). Note that LookOut ren2022look focuses on long-term novel view synthesis, so it has degenerated in balancing both objectives. Our proposed framework alleviates this seesaw problem and generates an image faster than the state-of-the-art methods.
  • Figure 2: An overview of network architecture. Our network takes a reference image $I_{ref}$ and a relative camera pose $T$ as inputs. The depth estimation network (DepthNet) first predicts a depth map $D$, and the view synthesis network (ViewNet) generates a target image $I_{tgt}$ from $I_{ref}$, $D$ and $T$. Specifically, $D$ is used for calculating the 3D world coordinate $X_{w}$ and the normalized image coordinate $X_{img}$ at the reference viewpoint, which are passed through various positional encoding layers in the encoder (e.g., $\delta_{global}$, $\delta_{local}^{abs}$ and $\delta_{local}^{rel}$) to provide the scene structure representations. Encoded features $f_{N}$ are then transformed by both Implicit Renderer and Explicit Renderer with $T$. Finally, two transformed feature map, $h_{i}$ and $h_{e}$, are concatenated to generate $I_{tgt}$ by the decoder.
  • Figure 3: Illustration of Local Set Attention Block. (a) A relative position in 3D world coordinates (a red dotted line) is decomposed into three relative positions (blue dotted lines). (b) Decomposed relative positions are applied for the corresponding positional encoding layer to output local set attention $g_{local}^i(p)$.
  • Figure 4: An overview of our transformation similarity loss. Two transformed features, $h_{i}$ and $h_{e}$, are complemented each other by the transformation similarity loss. Specifically, we first derive out-of-view mask $\textbf{O}$ from $K$, $D$ and $T$. By using O, two transformation similarity loss, i.e., $L_{ts, in}$ and $L_{ts, out}$, are applied to encourage the discriminability of $h_{i}$ and $h_{e}$, respectively. To guide the another renderer as intended, we allow the back-propagated gradients of $L_{ts, in}$ only to the reprojected regions of $h_{i}$, and those of $L_{ts, out}$ only to the out-of-view regions of $h_{e}$.
  • Figure 5: Quantitative comparisons on the averaged evaluation metrics over three splits.
  • ...and 16 more figures