Table of Contents
Fetching ...

ERUPT: Efficient Rendering with Unposed Patch Transformer

Maxim V. Shugaev, Vincent Chen, Maxim Karrenbach, Kyle Ashley, Bridget Kennedy, Naresh P. Cuntoor

TL;DR

This work proposes ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery and introduces patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view.

Abstract

This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery. We introduce patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view. This makes our model highly efficient both during training and at inference, capable of rendering at 600 fps on commercial hardware. Notably, our model is designed to use a learned latent camera pose which allows for training using unposed targets in datasets with sparse or inaccurate ground truth camera pose. We show that our approach can generalize on large real-world data and introduce a new benchmark dataset (MSVS-1M) for latent view synthesis using street-view imagery collected from Mapillary. In contrast to NeRF and Gaussian Splatting, which require dense imagery and precise metadata, ERUPT can render novel views of arbitrary scenes with as few as five unposed input images. ERUPT achieves better rendered image quality than current state-of-the-art methods for unposed image synthesis tasks, reduces labeled data requirements by ~95\% and decreases computational requirements by an order of magnitude, providing efficient novel view synthesis for diverse real-world scenes.

ERUPT: Efficient Rendering with Unposed Patch Transformer

TL;DR

This work proposes ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery and introduces patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view.

Abstract

This work addresses the problem of novel view synthesis in diverse scenes from small collections of RGB images. We propose ERUPT (Efficient Rendering with Unposed Patch Transformer) a state-of-the-art scene reconstruction model capable of efficient scene rendering using unposed imagery. We introduce patch-based querying, in contrast to existing pixel-based queries, to reduce the compute required to render a target view. This makes our model highly efficient both during training and at inference, capable of rendering at 600 fps on commercial hardware. Notably, our model is designed to use a learned latent camera pose which allows for training using unposed targets in datasets with sparse or inaccurate ground truth camera pose. We show that our approach can generalize on large real-world data and introduce a new benchmark dataset (MSVS-1M) for latent view synthesis using street-view imagery collected from Mapillary. In contrast to NeRF and Gaussian Splatting, which require dense imagery and precise metadata, ERUPT can render novel views of arbitrary scenes with as few as five unposed input images. ERUPT achieves better rendered image quality than current state-of-the-art methods for unposed image synthesis tasks, reduces labeled data requirements by ~95\% and decreases computational requirements by an order of magnitude, providing efficient novel view synthesis for diverse real-world scenes.

Paper Structure

This paper contains 24 sections, 5 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Schematic illustration of ERUPT model. B refers to backbone (feature extractor), while SA and CA refer to self- and cross- attention.
  • Figure 2: Qualitative results on MSN: input scene (5 images), target views (5 images), ERUPT B + LORA, ERUPT B + GAN, ERUPT B + Diffusion.
  • Figure 3: Qualitative results on MSVS-1M: input scene (5 images), target views (5 images), ERUPT B + LORA, ERUPT B + GAN, ERUPT B + Diffusion.
  • Figure 4: 360-rotation around a scene for cases with all items sufficiently represented in the input images (top) and having missing parts (bottom), which result in blur uncertain output in $L_{2}$ setup.
  • Figure 5: Scene consistency: effect of seed on SD output.
  • ...and 2 more figures