SimVS: Simulating World Inconsistencies for Robust View Synthesis
Alex Trevithick, Roni Paiss, Philipp Henzler, Dor Verbin, Rundi Wu, Hadi Alzayer, Ruiqi Gao, Ben Poole, Jonathan T. Barron, Aleksander Holynski, Ravi Ramamoorthi, Pratul P. Srinivasan
TL;DR
SimVS tackles robust novel-view synthesis under casual capture by training a harmonization network on data augmented with world inconsistencies simulated by a video diffusion model. The method first generates inconsistent conditioning views from existing multiview data, then learns to produce a consistent set of views that enables accurate 3D reconstruction via existing NeRF/diffusion tools. It demonstrates superior performance over heuristic augmentations and purely synthetic data for both dynamic scenes and lighting changes, enabling high-fidelity static 3D reconstructions under challenging conditions. The approach is scalable to broader video-model pipelines and can be extended to other architectures or camera-control synthesis tasks.
Abstract
Novel-view synthesis techniques achieve impressive results for static scenes but struggle when faced with the inconsistencies inherent to casual capture settings: varying illumination, scene motion, and other unintended effects that are difficult to model explicitly. We present an approach for leveraging generative video models to simulate the inconsistencies in the world that can occur during capture. We use this process, along with existing multi-view datasets, to create synthetic data for training a multi-view harmonization network that is able to reconcile inconsistent observations into a consistent 3D scene. We demonstrate that our world-simulation strategy significantly outperforms traditional augmentation methods in handling real-world scene variations, thereby enabling highly accurate static 3D reconstructions in the presence of a variety of challenging inconsistencies. Project page: https://alextrevithick.github.io/simvs
