Table of Contents
Fetching ...

UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

Jin Cao, Hongrui Wu, Ziyong Feng, Hujun Bao, Xiaowei Zhou, Sida Peng

TL;DR

UniVerse tackles robust 3D reconstruction from inconsistent multi-view images by decoupling restoration from reconstruction. It converts image sets into an initial video via a learned camera-trajectory-based sampling strategy, then uses a conditional video diffusion model (VDM) with MiQT and semantic/mask conditioning to restore frames to a consistent state before performing 3D reconstruction. The approach leverages a learned, large-scale scene prior from VDMs to handle diverse inconsistencies, supports style control through a reference style image, and remains effective even with sparse inputs. Empirical results on synthetic and real data show state-of-the-art robustness (PSNR/SSIM/LPIPS) and highlight UniVerse as a flexible preprocessor for downstream 3D tasks like View synthesis and NeRF-based reconstruction, with clear pathways for applying 3D priors to complex, real-world image collections.

Abstract

This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

UniVerse: Unleashing the Scene Prior of Video Diffusion Models for Robust Radiance Field Reconstruction

TL;DR

UniVerse tackles robust 3D reconstruction from inconsistent multi-view images by decoupling restoration from reconstruction. It converts image sets into an initial video via a learned camera-trajectory-based sampling strategy, then uses a conditional video diffusion model (VDM) with MiQT and semantic/mask conditioning to restore frames to a consistent state before performing 3D reconstruction. The approach leverages a learned, large-scale scene prior from VDMs to handle diverse inconsistencies, supports style control through a reference style image, and remains effective even with sparse inputs. Empirical results on synthetic and real data show state-of-the-art robustness (PSNR/SSIM/LPIPS) and highlight UniVerse as a flexible preprocessor for downstream 3D tasks like View synthesis and NeRF-based reconstruction, with clear pathways for applying 3D priors to complex, real-world image collections.

Abstract

This paper tackles the challenge of robust reconstruction, i.e., the task of reconstructing a 3D scene from a set of inconsistent multi-view images. Some recent works have attempted to simultaneously remove image inconsistencies and perform reconstruction by integrating image degradation modeling into neural 3D scene representations. However, these methods rely heavily on dense observations for robustly optimizing model parameters. To address this issue, we propose to decouple robust reconstruction into two subtasks: restoration and reconstruction, which naturally simplifies the optimization process. To this end, we introduce UniVerse, a unified framework for robust reconstruction based on a video diffusion model. Specifically, UniVerse first converts inconsistent images into initial videos, then uses a specially designed video diffusion model to restore them into consistent images, and finally reconstructs the 3D scenes from these restored images. Compared with case-by-case per-view degradation modeling, the diffusion model learns a general scene prior from large-scale data, making it applicable to diverse image inconsistencies. Extensive experiments on both synthetic and real-world datasets demonstrate the strong generalization capability and superior performance of our method in robust reconstruction. Moreover, UniVerse can control the style of the reconstructed 3D scene. Project page: https://jin-cao-tma.github.io/UniVerse.github.io/

Paper Structure

This paper contains 28 sections, 15 equations, 13 figures, 5 tables, 2 algorithms.

Figures (13)

  • Figure 1: Given a set of inconsistent multi-view images with inconsistencies such as photometric variation or transient occlusions, as shown in (a), existing robust reconstruction methods often fail to produce a high-quality 3D scene with minimal artifacts and floaters when the views are not dense enough, as illustrated in (b). In contrast, our method first utilizes a Video Diffusion Model to restore all images into a consistent state in (c), and then reconstructs the 3D scene from these restored images, resulting in the high-quality 3D scene in (d).
  • Figure 2: The flowchart of UniVerse. Given a set of inconsistent images, we first convert them into an initial video. We then use SAM kirillov2023segany to identify transient occlusions and generate inpainting masks. These masks are used to set the occluded pixels in the initial video to zero. Next, we encode the video into latents using a VAE Encoder. After setting one image as the style image and assigning it style mask, we concatenate the style masks, inpainting masks, latents, and randomly sampled Gaussian noise along the channel dimension and feed them into the U-Net. For each masked input image, we obtain semantic embeddings using the CLIP image encoder and aggregate them via the Multi-input Query Transformer to form a global semantic embedding. This embedding guides the U-Net in the video generation process. Finally, the U-Net output is decoded by the VAE Decoder to produce the restored video, from which we extract the consistent images and reconstruct a high-quality 3D scene. If too many images for the VDM to restore at once, we iteratively restore them in batches as described.
  • Figure 3: Visual results of novel view synthesis on synthetic datasets, with the corresponding depth map displayed in the bottom left corner.
  • Figure 4: Visual results of novel view synthesis on real datasets, with the corresponding depth map displayed in the bottom left corner.
  • Figure 5: Samples of our captured images.
  • ...and 8 more figures