Table of Contents
Fetching ...

SimVS: Simulating World Inconsistencies for Robust View Synthesis

Alex Trevithick, Roni Paiss, Philipp Henzler, Dor Verbin, Rundi Wu, Hadi Alzayer, Ruiqi Gao, Ben Poole, Jonathan T. Barron, Aleksander Holynski, Ravi Ramamoorthi, Pratul P. Srinivasan

TL;DR

SimVS tackles robust novel-view synthesis under casual capture by training a harmonization network on data augmented with world inconsistencies simulated by a video diffusion model. The method first generates inconsistent conditioning views from existing multiview data, then learns to produce a consistent set of views that enables accurate 3D reconstruction via existing NeRF/diffusion tools. It demonstrates superior performance over heuristic augmentations and purely synthetic data for both dynamic scenes and lighting changes, enabling high-fidelity static 3D reconstructions under challenging conditions. The approach is scalable to broader video-model pipelines and can be extended to other architectures or camera-control synthesis tasks.

Abstract

Novel-view synthesis techniques achieve impressive results for static scenes but struggle when faced with the inconsistencies inherent to casual capture settings: varying illumination, scene motion, and other unintended effects that are difficult to model explicitly. We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We use this process, along with existing multi-view datasets, to create synthetic data for training a multi-view harmonization network that is able to reconcile inconsistent observations into a consistent 3D scene. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations, thereby enabling highly accurate static 3D reconstructions in the presence of a variety of challenging inconsistencies. Project page: https://alextrevithick.github.io/simvs

SimVS: Simulating World Inconsistencies for Robust View Synthesis

TL;DR

SimVS tackles robust novel-view synthesis under casual capture by training a harmonization network on data augmented with world inconsistencies simulated by a video diffusion model. The method first generates inconsistent conditioning views from existing multiview data, then learns to produce a consistent set of views that enables accurate 3D reconstruction via existing NeRF/diffusion tools. It demonstrates superior performance over heuristic augmentations and purely synthetic data for both dynamic scenes and lighting changes, enabling high-fidelity static 3D reconstructions under challenging conditions. The approach is scalable to broader video-model pipelines and can be extended to other architectures or camera-control synthesis tasks.

Abstract

Novel-view synthesis techniques achieve impressive results for static scenes but struggle when faced with the inconsistencies inherent to casual capture settings: varying illumination, scene motion, and other unintended effects that are difficult to model explicitly. We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We use this process, along with existing multi-view datasets, to create synthetic data for training a multi-view harmonization network that is able to reconcile inconsistent observations into a consistent 3D scene. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations, thereby enabling highly accurate static 3D reconstructions in the presence of a variety of challenging inconsistencies. Project page: https://alextrevithick.github.io/simvs

Paper Structure

This paper contains 20 sections, 4 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: We show results of our model applied to a casual in-the-wild capture. (a) Given $8$ unordered images of a scene with significant motion and desired states marked in blue and orange, our model generates a 3D representation for each desired state shown in corresponding colors in (b) and (c). The CAT3D baseline cat3d in (d) cannot disentangle the different states, resulting in catastrophic failure.
  • Figure 2: A comparison of real world state changes, those simulated through a video model, and heuristic augmentations (random sparse flow fields for dynamics and random color tints for lighting).
  • Figure 3: Our method's overall pipeline. (a) Given a dataset of multiview images $x_i$, we simulate inconsistencies by (b) prompting a (c) video model and then (d) selecting inconsistent frames $\Tilde{x}_i$. We feed these images along with a held-out reference image $x_0$ under the original condition to a (e) multiview generative model to predict (f) a set of corresponding consistent outputs $\hat{x}_i$. This output is supervised by the original multiview images $x_i$.
  • Figure 4: Samples from our multiview diffusion harmonization model, visualized for scene dynamics. Given the reference image and inconsistent input image, our model directly generates multiview images consistent with the state of the reference.
  • Figure 5: Samples from our multiview diffusion harmonization model, visualized for lighting. Given the reference image and inconsistent input image, our model directly generates multiview images consistent with the state of the reference.
  • ...and 3 more figures