Bridging Implicit and Explicit Geometric Transformation for Single-Image View Synthesis
Byeongjun Park, Hyojun Go, Changick Kim
TL;DR
This work tackles the seesaw problem in single-image view synthesis, where preserving reprojected seen content and realistically filling unseen regions are at odds. It introduces a non-autoregressive framework that bridges explicit and implicit geometric transformations via two parallel renderers and a depth-informed GLSA encoder, unified by a transformation similarity loss. The method achieves state-of-the-art performance on RealEstate10K and ACID, with PSNR-vis improvements and the lowest FID, while delivering approximately $\approx 100\times$ faster inference than autoregressive rivals. By combining depth-driven 3D geometry, global-local point-cloud attention, and a balanced renderer collaboration, the approach enables real-time, high-fidelity novel-view synthesis and shows promise for extrapolation tasks beyond single-view setups.
Abstract
Creating novel views from a single image has achieved tremendous strides with advanced autoregressive models, as unseen regions have to be inferred from the visible scene contents. Although recent methods generate high-quality novel views, synthesizing with only one explicit or implicit 3D geometry has a trade-off between two objectives that we call the "seesaw" problem: 1) preserving reprojected contents and 2) completing realistic out-of-view regions. Also, autoregressive models require a considerable computational cost. In this paper, we propose a single-image view synthesis framework for mitigating the seesaw problem while utilizing an efficient non-autoregressive model. Motivated by the characteristics that explicit methods well preserve reprojected pixels and implicit methods complete realistic out-of-view regions, we introduce a loss function to complement two renderers. Our loss function promotes that explicit features improve the reprojected area of implicit features and implicit features improve the out-of-view area of explicit features. With the proposed architecture and loss function, we can alleviate the seesaw problem, outperforming autoregressive-based state-of-the-art methods and generating an image $\approx$100 times faster. We validate the efficiency and effectiveness of our method with experiments on RealEstate10K and ACID datasets.
