Table of Contents
Fetching ...

Intrinsic Image Diffusion for Indoor Single-view Material Estimation

Peter Kocsis, Vincent Sitzmann, Matthias Nießner

TL;DR

This work presents Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes that produces significantly sharper, more consis-tent, and more detailed materials, outperforming state-of-the-art methods by 1.5dB on PSNR and by 45% better FID score on albedo prediction.

Abstract

We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by $1.5dB$ on PSNR and by $45\%$ better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.

Intrinsic Image Diffusion for Indoor Single-view Material Estimation

TL;DR

This work presents Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes that produces significantly sharper, more consis-tent, and more detailed materials, outperforming state-of-the-art methods by 1.5dB on PSNR and by 45% better FID score on albedo prediction.

Abstract

We present Intrinsic Image Diffusion, a generative model for appearance decomposition of indoor scenes. Given a single input view, we sample multiple possible material explanations represented as albedo, roughness, and metallic maps. Appearance decomposition poses a considerable challenge in computer vision due to the inherent ambiguity between lighting and material properties and the lack of real datasets. To address this issue, we advocate for a probabilistic formulation, where instead of attempting to directly predict the true material properties, we employ a conditional generative model to sample from the solution space. Furthermore, we show that utilizing the strong learned prior of recent diffusion models trained on large-scale real-world images can be adapted to material estimation and highly improves the generalization to real images. Our method produces significantly sharper, more consistent, and more detailed materials, outperforming state-of-the-art methods by on PSNR and by better FID score on albedo prediction. We demonstrate the effectiveness of our approach through experiments on both synthetic and real-world datasets.
Paper Structure (23 sections, 2 equations, 16 figures, 6 tables)

This paper contains 23 sections, 2 equations, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Training pipeline. We train a conditional diffusion model to predict albedo and BRDF properties (roughness and metallic) given a single input image. We adapt the learned prior of Stable Diffusion LDM by fine-tuning it on the synthetic InteriorVerse ComplexInvIndoorMC dataset. Models being trained are marked with yellow. (i) First, we separately encode the ground-truth (GT) albedo and BRDF properties with a fixed encoder to obtain the material feature maps. We also encode the conditioning image with a trainable encoder. (ii) We add noise to the material features and use our conditional diffusion model to predicted the noise. (iii) The training is supervised with L2 loss between the original and predicted noise. (iv) Using the predicted noise, the predicted material properties can be decoded separately.
  • Figure 2: Synthetic evaluation. We qualitatively compare the predicted albedos of our to the baselines ComplexInvIndoorComplexInvIndoorMC on the InteriorVerse dataset ComplexInvIndoorMC. Both of the baselines produce smoothed results, often with baked-in lighting, specularities, or shadows. In contrast, our method gives sharp and detailed predictions with consistent textures. See supplementary for more results with roughness and metallic predictions.
  • Figure 3: Sample diversity. We show multiple samples for a single scene and visualize the variance of the images across $100$ samples. Specular and emissive objects have higher variance since their material properties are highly ambiguous.
  • Figure 4: Lighting Optimization. We optimize for $N_{light}$ point light sources with SG emission profile together with global incident lighting. Our representation is expressive enough to capture detailed emissions yet controllable for relighting purposes.
  • Figure 5: Real-world evaluation. We qualitatively compare the predicted albedos of our method to the baselines ComplexInvIndoorComplexInvIndoorMC on the IIW IIW and ScanNet++ scannet++ datasets. Lights and shadows pose a challenge for the previous methods, but our approach gives consistent results on real-world inputs as well. We also visualize the variance of our predictions, showing how the specular, emissive, and small objects have higher uncertainty. See supplementary for more results with roughness and metallic predictions.
  • ...and 11 more figures