Table of Contents
Fetching ...

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

Andreas Engelhardt, Mark Boss, Vikram Voleti, Chun-Han Yao, Hendrik P. A. Lensch, Varun Jampani

TL;DR

SViM3D tackles single-image inverse rendering by extending latent video diffusion to jointly produce multi-view RGB, spatially varying PBR parameters, and surface normals under camera control. It combines a material-encoded latent representation, an adapted UNet architecture, and a multi-illumination training regime, augmented with view-dependent masking and learnable homographies to enhance 3D reconstruction fidelity. A fast, differentiable environment-based lighting pipeline enables high-frequency relighting and 3D rendering, while a NeRF/DMTet-based pipeline lifts the outputs into textured 3D assets. The approach achieves state-of-the-art performance in novel view synthesis, relighting, and 3D reconstruction on object-centric datasets, and provides a robust neural prior for downstream AR/VR, film, and game applications.

Abstract

We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.

SViM3D: Stable Video Material Diffusion for Single Image 3D Generation

TL;DR

SViM3D tackles single-image inverse rendering by extending latent video diffusion to jointly produce multi-view RGB, spatially varying PBR parameters, and surface normals under camera control. It combines a material-encoded latent representation, an adapted UNet architecture, and a multi-illumination training regime, augmented with view-dependent masking and learnable homographies to enhance 3D reconstruction fidelity. A fast, differentiable environment-based lighting pipeline enables high-frequency relighting and 3D rendering, while a NeRF/DMTet-based pipeline lifts the outputs into textured 3D assets. The approach achieves state-of-the-art performance in novel view synthesis, relighting, and 3D reconstruction on object-centric datasets, and provides a robust neural prior for downstream AR/VR, film, and game applications.

Abstract

We present Stable Video Materials 3D (SViM3D), a framework to predict multi-view consistent physically based rendering (PBR) materials, given a single image. Recently, video diffusion models have been successfully used to reconstruct 3D objects from a single image efficiently. However, reflectance is still represented by simple material models or needs to be estimated in additional steps to enable relighting and controlled appearance edits. We extend a latent video diffusion model to output spatially varying PBR parameters and surface normals jointly with each generated view based on explicit camera control. This unique setup allows for relighting and generating a 3D asset using our model as neural prior. We introduce various mechanisms to this pipeline that improve quality in this ill-posed setting. We show state-of-the-art relighting and novel view synthesis performance on multiple object-centric datasets. Our method generalizes to diverse inputs, enabling the generation of relightable 3D assets useful in AR/VR, movies, games and other visual media.

Paper Structure

This paper contains 25 sections, 3 equations, 23 figures, 9 tables.

Figures (23)

  • Figure 1: SViM3D Improvements on Common Issues. Our method introduces several new contributions which improve the reconstruction quality of our method drastically.
  • Figure 2: The SViM3D pipeline. We train a video diffusion model on multi-view and multi-illumination data to generate multi-view images with material parameters. During inference, given a single image, SViM3D can generate 21 views with consistent RGB radiance, albedo, roughness, metallic, and camera space normals. We then use the synthesized novel views for 3D reconstruction that yields textured meshes with PBR materials. Starting from illumination pre-optimization, we further propose several techniques to aid the 3D reconstruction pipeline in this sparse view setting, such as visibility masking, homography correction, fast differentiable rendering.
  • Figure 3: Multi-view consistency. We compare the generated materials from different neural diffusion priors in a multi-view setting. SV3D voletiSV3DNovelMultiview2024 shows multi-view consistent RGB output similar to SViM3D that also generates multi-view consistent Basecolor. Generating albedo maps on top of the SV3D views using RGB$\leftrightarrow$X zengRGBX2024, StableMaterial (SM) of MaterialFusion litmanMaterialFusionEnhancingInverse2024 or Intrinsic Image Diffusion (IID) kocsisIntrinsicImageDiffusion2023 yields inconsistent results compared to the GT.
  • Figure 4: Multi-view PBR materials. Given the input image SViM3D generates multi-view consistent novel views with corresponding basecolor, roughness, metallic and normal maps. These can directly be used to generate views under novel illumination. We show 5 samples from a generated orbit and two new illumination settings as examples. The objects are sourced from our Poly Haven polyhavenPolyHaven test dataset. Please find additional results in the supplementary material.
  • Figure 5: Single image PBR materials. We compare the generated materials from different neural diffusion priors for a single image from the Poly Haven polyhavenPolyHaven test set. Besides the GT rendering and SViM3D (ours) results from RGB$\leftrightarrow$X zengRGBX2024, StableMaterial (SM) of MaterialFusion litmanMaterialFusionEnhancingInverse2024 and Intrinsic Image Diffusion (IID) kocsisIntrinsicImageDiffusion2023 are presented. Note that IID uses monocular normals that are separately generated and SM does not provide any normals.
  • ...and 18 more figures