Table of Contents
Fetching ...

MatFusion: A Generative Diffusion Model for SVBRDF Capture

Sam Sartor, Pieter Peers

TL;DR

MatFusion reframes SVBRDF capture as a diffusion problem and introduces an unconditional 256M-parameter backbone trained on $312{,}165$ synthetic SVBRDF exemplars to learn the distribution of spatially varying materials. It then softly conditions this backbone with zero-initialized input heads to create three lighting-conditioned refinements (colocated, natural, and flash/no-flash), enabling multiple plausible SVBRDF estimates from a single photograph by varying the seed. The approach avoids backpropagation through rendering during backbone training and supports diverse, controllable material reconstructions that better account for indirect lighting under natural illumination. Across synthetic and real-world tests, MatFusion achieves competitive or superior perceptual quality compared with prior direct inference and look-ahead methods, while offering sampling-based exploration of plausible materials for user selection. This diffusion-based framework provides a flexible, lighting-agnostic foundation for SVBRDF capture with practical implications for appearance modeling and material editing.

Abstract

We formulate SVBRDF estimation from photographs as a diffusion task. To model the distribution of spatially varying materials, we first train a novel unconditional SVBRDF diffusion backbone model on a large set of 312,165 synthetic spatially varying material exemplars. This SVBRDF diffusion backbone model, named MatFusion, can then serve as a basis for refining a conditional diffusion model to estimate the material properties from a photograph under controlled or uncontrolled lighting. Our backbone MatFusion model is trained using only a loss on the reflectance properties, and therefore refinement can be paired with more expensive rendering methods without the need for backpropagation during training. Because the conditional SVBRDF diffusion models are generative, we can synthesize multiple SVBRDF estimates from the same input photograph from which the user can select the one that best matches the users' expectation. We demonstrate the flexibility of our method by refining different SVBRDF diffusion models conditioned on different types of incident lighting, and show that for a single photograph under colocated flash lighting our method achieves equal or better accuracy than existing SVBRDF estimation methods.

MatFusion: A Generative Diffusion Model for SVBRDF Capture

TL;DR

MatFusion reframes SVBRDF capture as a diffusion problem and introduces an unconditional 256M-parameter backbone trained on synthetic SVBRDF exemplars to learn the distribution of spatially varying materials. It then softly conditions this backbone with zero-initialized input heads to create three lighting-conditioned refinements (colocated, natural, and flash/no-flash), enabling multiple plausible SVBRDF estimates from a single photograph by varying the seed. The approach avoids backpropagation through rendering during backbone training and supports diverse, controllable material reconstructions that better account for indirect lighting under natural illumination. Across synthetic and real-world tests, MatFusion achieves competitive or superior perceptual quality compared with prior direct inference and look-ahead methods, while offering sampling-based exploration of plausible materials for user selection. This diffusion-based framework provides a flexible, lighting-agnostic foundation for SVBRDF capture with practical implications for appearance modeling and material editing.

Abstract

We formulate SVBRDF estimation from photographs as a diffusion task. To model the distribution of spatially varying materials, we first train a novel unconditional SVBRDF diffusion backbone model on a large set of 312,165 synthetic spatially varying material exemplars. This SVBRDF diffusion backbone model, named MatFusion, can then serve as a basis for refining a conditional diffusion model to estimate the material properties from a photograph under controlled or uncontrolled lighting. Our backbone MatFusion model is trained using only a loss on the reflectance properties, and therefore refinement can be paired with more expensive rendering methods without the need for backpropagation during training. Because the conditional SVBRDF diffusion models are generative, we can synthesize multiple SVBRDF estimates from the same input photograph from which the user can select the one that best matches the users' expectation. We demonstrate the flexibility of our method by refining different SVBRDF diffusion models conditioned on different types of incident lighting, and show that for a single photograph under colocated flash lighting our method achieves equal or better accuracy than existing SVBRDF estimation methods.
Paper Structure (21 sections, 4 equations, 9 figures, 2 tables)

This paper contains 21 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Summary of the MatFusion architecture.
  • Figure 2: For the first diffusion step, the denoising neural network $D_\theta$ fully relies on the input photograph (left) and acts as a direct inference network (middle). However, in contrast to direct inference, a diffusion model iteratively improves the estimate (right) by reducing burn-in, adding detail in the normal map, and improving diffuse-specular separation.
  • Figure 3: Global illumination transport within the spatially varying material is negligible for a colocated camera-light setup. However, under natural lighting, the effects are significant (i.e., self-shadowing and ambient occlusion).
  • Figure 4: Changing the seed results in different SVBRDF replicates conditioned on the input photograph. For each replicate we show a rendering under a different lighting than the input photograph as well as the generated SVBRDF property maps. Also marked are the SVBRDF selection based on the render error with respect to the input lighting, as well as the manual selection of the (subjectively) most plausible SVBRDF.
  • Figure 5: Qualitative comparison on real-world materials captured with a colocated light source, and relit from two different point light positions.
  • ...and 4 more figures