PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis
Jason J. Yu, Tristan Aumentado-Armstrong, Fereshteh Forghani, Konstantinos G. Derpanis, Marcus A. Brubaker
TL;DR
PolyOculus introduces a permutation-invariant set-to-set diffusion model for GNVS that defines $p_\theta(\{\mathbf{z}_1,. . .,.\mathbf{z}_N\}|\mathbf{c})$ and jointly generates a set of novel views conditioned on a set of observed views, mitigating loop inconsistencies of autoregressive GNVS. The approach uses a shared U-Net with cross-attention across streams and camera-ray canonicalization with Fourier-encoded rays to inject geometry without explicit 3D priors. It adopts a semi-autoregressive sampling strategy with keyframes via a factorization $p(\mathbf{z}_1,. . .,.\mathbf{z}_N)=\prod_{i=1}^G p(\{\mathbf{z}_j| j\in \mathcal{V}_i\} | \{\mathbf{z}_k| k\in \mathcal{C}_i\})$, enabling Ours-Markov, Ours-1step, and Ours-KF; and demonstrates improved image quality (FID) and consistency (TSED) on RealEstate10K and Matterport3D compared to image-based GNVS baselines and rendering-based NeRF methods. The key contribution is enabling unordered view-sets, including looped and stereo-grouped trajectories, with reduced computational burden relative to NeRF-based pipelines, supporting scalable GNVS for real-world multi-view applications.
Abstract
This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate our model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.
