Table of Contents
Fetching ...

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

Yuci Han, Charles Toth, John E. Anderson, William J. Shuart, Alper Yilmaz

TL;DR

BetterScene leverages the production-ready Stable Video Diffusion model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time, and integrates a feed-forward 3D Gaussian Splatting model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views.

Abstract

We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.

BetterScene: 3D Scene Synthesis with Representation-Aligned Generative Model

TL;DR

BetterScene leverages the production-ready Stable Video Diffusion model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time, and integrates a feed-forward 3D Gaussian Splatting model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views.

Abstract

We present BetterScene, an approach to enhance novel view synthesis (NVS) quality for diverse real-world scenes using extremely sparse, unconstrained photos. BetterScene leverages the production-ready Stable Video Diffusion (SVD) model pretrained on billions of frames as a strong backbone, aiming to mitigate artifacts and recover view-consistent details at inference time. Conventional methods have developed similar diffusion-based solutions to address these challenges of novel view synthesis. Despite significant improvements, these methods typically rely on off-the-shelf pretrained diffusion priors and fine-tune only the UNet module while keeping other components frozen, which still leads to inconsistent details and artifacts even when incorporating geometry-aware regularizations like depth or semantic conditions. To address this, we investigate the latent space of the diffusion model and introduce two components: (1) temporal equivariance regularization and (2) vision foundation model-aligned representation, both applied to the variational autoencoder (VAE) module within the SVD pipeline. BetterScene integrates a feed-forward 3D Gaussian Splatting (3DGS) model to render features as inputs for the SVD enhancer and generate continuous, artifact-free, consistent novel views. We evaluate on the challenging DL3DV-10K dataset and demonstrate superior performance compared to state-of-the-art methods.
Paper Structure (11 sections, 7 equations, 4 figures, 2 tables)

This paper contains 11 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: We demonstrate our BetterScene approach on diverse in-the-wild scenes. Given sparse inputs, recent novel view synthesis methods suffer from performance degradation due to insufficient visual information. BetterScene enhances novel view rendering quality by mitigating artifacts and recovering view-consistent details at inference time with an alias-free, representation-aligned video diffusion model.
  • Figure 2: The visual quality and reconstruction FID score (rFID) for autoencoders with different channel sizes. We trained all the autoencoders on the DL3DV-10K ling2024dl3dv dataset. Results show that the original 4-channel autoencoder design Rombach2021HighResolutionIS, which is widely used in diffusion models is unable to reconstruct fine details. Moreover, as shown in (b) and (c), increasing channel size leads to much better reconstructions. We choose to use a 64-channel BetterScene autoencoder for our video diffusion model.
  • Figure 3: Overview of our BetterScene. The training process consists of two stages. In the first stage, we train an autoencoder using a representation-aligned and equivariance-regularized objective function. In the second stage, we freeze the pretrained BetterScene-VAE and fine-tune the denoiser U-Net within the SVD framework. We leverage a feed-forward 3DGS rendering module, MVSplat, to generate both coarse synthesized views and corresponding Gaussian feature latents. The SVD module then processes these coarse features to decode enhanced high-quality images.
  • Figure 4: A visual comparison of enhanced rendering results generated from 5 input views across scenes from the DL3DV benchmark test set. BetterScene demonstrates superior visual quality and enhanced detail consistency compared to existing state-of-the-art approaches.