Table of Contents
Fetching ...

PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

Jason J. Yu, Tristan Aumentado-Armstrong, Fereshteh Forghani, Konstantinos G. Derpanis, Marcus A. Brubaker

TL;DR

PolyOculus introduces a permutation-invariant set-to-set diffusion model for GNVS that defines $p_\theta(\{\mathbf{z}_1,. . .,.\mathbf{z}_N\}|\mathbf{c})$ and jointly generates a set of novel views conditioned on a set of observed views, mitigating loop inconsistencies of autoregressive GNVS. The approach uses a shared U-Net with cross-attention across streams and camera-ray canonicalization with Fourier-encoded rays to inject geometry without explicit 3D priors. It adopts a semi-autoregressive sampling strategy with keyframes via a factorization $p(\mathbf{z}_1,. . .,.\mathbf{z}_N)=\prod_{i=1}^G p(\{\mathbf{z}_j| j\in \mathcal{V}_i\} | \{\mathbf{z}_k| k\in \mathcal{C}_i\})$, enabling Ours-Markov, Ours-1step, and Ours-KF; and demonstrates improved image quality (FID) and consistency (TSED) on RealEstate10K and Matterport3D compared to image-based GNVS baselines and rendering-based NeRF methods. The key contribution is enabling unordered view-sets, including looped and stereo-grouped trajectories, with reduced computational burden relative to NeRF-based pipelines, supporting scalable GNVS for real-world multi-view applications.

Abstract

This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate our model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.

PolyOculus: Simultaneous Multi-view Image-based Novel View Synthesis

TL;DR

PolyOculus introduces a permutation-invariant set-to-set diffusion model for GNVS that defines and jointly generates a set of novel views conditioned on a set of observed views, mitigating loop inconsistencies of autoregressive GNVS. The approach uses a shared U-Net with cross-attention across streams and camera-ray canonicalization with Fourier-encoded rays to inject geometry without explicit 3D priors. It adopts a semi-autoregressive sampling strategy with keyframes via a factorization , enabling Ours-Markov, Ours-1step, and Ours-KF; and demonstrates improved image quality (FID) and consistency (TSED) on RealEstate10K and Matterport3D compared to image-based GNVS baselines and rendering-based NeRF methods. The key contribution is enabling unordered view-sets, including looped and stereo-grouped trajectories, with reduced computational burden relative to NeRF-based pipelines, supporting scalable GNVS for real-world multi-view applications.

Abstract

This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate our model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks.
Paper Structure (17 sections, 9 equations, 15 figures, 3 tables, 3 algorithms)

This paper contains 17 sections, 9 equations, 15 figures, 3 tables, 3 algorithms.

Figures (15)

  • Figure 1: A "loop inconsistency", in which an image-based GNVS network fails to retain consistency with previous generations, and its resolution with a set-based approach. Left: An observed (real) image from which the synthesized views are conditioned, and camera trajectory (a "spinning" motion pointed at a fixed target) which temporarily conceals parts of the scene. Top row: An autoregressive strategy 'forgets' its prior outputs and invents inconsistent content (e.g., compare the marked areas in the final column to the observation). Bottom row: Our method constructs self-consistent views by (i) more intelligent conditioning on the most relevant images and (ii) simultaneous generation of a multi-view set, allowing mutual constraints within the sampling process.
  • Figure 2: Overview of our set-based diffusion architecture for NVS. Given a conditioning image set, $\textcolor{tadblue}{ X_c =}$$\textcolor{tadblue}{ \{x_1,\ldots,x_{n_c}\} }$(left inset, left-stream), we generate a set of novel views, $\textcolor{taoranget}{ X_{g,t} = \{ x_{n_c+1,t},\ldots,x_{n_c + n_g,t} \} }$(left inset, right-stream). Processing $\textcolor{tadblue}{ X_c }$ is identical to $\textcolor{taoranget}{ X_{g,t} }$, except time is fixed to $t\equiv 0$ (i.e., without noise). Simultaneous generation is performed by independently applying the U-Net across streams, to each $x\in \textcolor{tadblue}{ X_c }\cup \textcolor{taoranget}{ X_{g,t} }$, except at cross-attention (CA) layers (middle inset), which facilitate order-independent inter-stream dependencies. Each CA block combines (i) same-layer features ($f_{j,t}$) and (ii) camera information ($\widetilde{c}_j$, with rays $\widetilde{\mathcal{R}}_j$) across streams. For stream $i$, a camera canonicalization block (right inset) provides invariance to rigid transforms, via ${\mathcal{R}}_j = \mathrm{CC}_i( \widetilde{\mathcal{R}}_j )$, which treats $\widetilde{c}_i$ as a reference viewpoint. Then, with the reference rays as the queries, $Q_{i,t} = \mathcal{R}_\mathrm{ref}$, attention is applied across all streams $j$, with the keys as transformed rays, $\{ \mathcal{R}_j \}_j$, and the values as the features, $\{ f_{j,t} \}_j$.
  • Figure 2: RealEstate10K reconstruction errors for short-term and long-term view extrapolation at $128\times128$.
  • Figure 3: Illustration of different generation orders, and the sampling depth of each view from the observed image. The sampling depth of each view from the observed view(s) is highlighted in red for (a) standard Markov autoregressive, (b) keyframed, and (c) grouped (e.g., stereo camera views) sampling. Notice that the view(s) with the largest sampling depth when sampling with (b) and (c) grows slower with respect to the total number of views, than with (a). This reduces the error accumulation later views.
  • Figure 3: Reconstruction metrics on stereo pairs, where ground-truth reference frames are available ("right eye" view).
  • ...and 10 more figures