Table of Contents
Fetching ...

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Adrien Ramanana Rahary, Nicolas Dufour, Patrick Perez, David Picard

Abstract

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.

One View Is Enough! Monocular Training for In-the-Wild Novel View Generation

Abstract

Monocular novel-view synthesis has long required multi-view image pairs for supervision, limiting training data scale and diversity. We argue it is not necessary: one view is enough. We present OVIE, trained entirely on unpaired internet images. We leverage a monocular depth estimator as a geometric scaffold at training time: we lift a source image into 3D, apply a sampled camera transformation, and project to obtain a pseudo-target view. To handle disocclusions, we introduce a masked training formulation that restricts geometric, perceptual, and textural losses to valid regions, enabling training on 30 million uncurated images. At inference, OVIE is geometry-free, requiring no depth estimator or 3D representation. Trained exclusively on in-the-wild images, OVIE outperforms prior methods in a zero-shot setting, while being 600x faster than the second-best baseline. Code and models are publicly available at https://github.com/AdrienRR/ovie.
Paper Structure (43 sections, 6 equations, 16 figures, 5 tables)

This paper contains 43 sections, 6 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: OVIE generates novel views from a single image across diverse domains given a source image (gray) and target poses (colored), regardless of content or style.
  • Figure 2: Method overview.Top: From web-sourced images $I_0$, a frozen monocular depth estimator extracts per-image 3D point clouds $\mathcal{P}$. We then sample camera transformations $T_{0 \rightarrow 1} \in SE(3)$ (rotation and translation), apply them to the point clouds, and reproject to generate pseudo-target views $I_1^*$. Bottom: Our model $f_\theta$ takes a source image $I_0$ and, conditioned on a camera transformation $T_{0 \rightarrow 1}$, predicts the corresponding novel view $\hat{I}_1$. Training combines a masked reconstruction loss $\mathcal{L}_{\text{recon}}$ and perceptual loss $\mathcal{L}_{\text{perc}}$ between $\hat{I}_1$ and $I_1^*$, and an adversarial loss $\mathcal{L}_{\text{adv}}$ where the discriminator $D_\phi$ distinguishes source images $I_0$ from predicted views $\hat{I}_1$.
  • Figure 3: Qualitative comparison with state-of-the-art methods. Given a source image and a target camera pose, each method synthesizes a novel view. Despite never being trained on multi-view data, OVIE produces sharp novel views with consistent geometry and accurately follows camera pose changes. Concurrent methods can fail to enforce the target pose entirely, or produce geometrically inconsistent results.
  • Figure 4: Metric scale understanding. The same 20 cm camera translation is applied to two scenes of different physical scales. The close-up banana (left, 50 cm away) undergoes a large apparent displacement, while the room-scale scene (right, 3 m away) shows a proportionally smaller shift consistent with metrically correct parallax.
  • Figure 5: Scaling with dataset size. PSNR and FID on RealEstate10K as a function of training set size. Both metrics improve consistently as data volume increases. SSIM and LPIPS curves, which follow the same trend, are reported in the Supplementary.
  • ...and 11 more figures