MatFusion: A Generative Diffusion Model for SVBRDF Capture
Sam Sartor, Pieter Peers
TL;DR
MatFusion reframes SVBRDF capture as a diffusion problem and introduces an unconditional 256M-parameter backbone trained on $312{,}165$ synthetic SVBRDF exemplars to learn the distribution of spatially varying materials. It then softly conditions this backbone with zero-initialized input heads to create three lighting-conditioned refinements (colocated, natural, and flash/no-flash), enabling multiple plausible SVBRDF estimates from a single photograph by varying the seed. The approach avoids backpropagation through rendering during backbone training and supports diverse, controllable material reconstructions that better account for indirect lighting under natural illumination. Across synthetic and real-world tests, MatFusion achieves competitive or superior perceptual quality compared with prior direct inference and look-ahead methods, while offering sampling-based exploration of plausible materials for user selection. This diffusion-based framework provides a flexible, lighting-agnostic foundation for SVBRDF capture with practical implications for appearance modeling and material editing.
Abstract
We formulate SVBRDF estimation from photographs as a diffusion task. To model the distribution of spatially varying materials, we first train a novel unconditional SVBRDF diffusion backbone model on a large set of 312,165 synthetic spatially varying material exemplars. This SVBRDF diffusion backbone model, named MatFusion, can then serve as a basis for refining a conditional diffusion model to estimate the material properties from a photograph under controlled or uncontrolled lighting. Our backbone MatFusion model is trained using only a loss on the reflectance properties, and therefore refinement can be paired with more expensive rendering methods without the need for backpropagation during training. Because the conditional SVBRDF diffusion models are generative, we can synthesize multiple SVBRDF estimates from the same input photograph from which the user can select the one that best matches the users' expectation. We demonstrate the flexibility of our method by refining different SVBRDF diffusion models conditioned on different types of incident lighting, and show that for a single photograph under colocated flash lighting our method achieves equal or better accuracy than existing SVBRDF estimation methods.
