Table of Contents
Fetching ...

ControlMat: A Controlled Generative Approach to Material Capture

Giuseppe Vecchio, Rosalie Martin, Arthur Roullier, Adrien Kaiser, Romain Rouffet, Valentin Deschaintre, Tamy Boubekeur

TL;DR

ControlMat tackles material capture from a single photograph under unknown illumination by casting it as a controlled diffusion synthesis problem. It introduces MatGen, a latent diffusion model backed by a VAE, and employs global (CLIP-based) and spatial (ControlNet) conditioning to generate 9 SVBRDF maps with tileable, high-resolution outputs. Innovations including noise rolling, multiscale diffusion, and patched decoding enable consistent patch-based generation and border-inpainting, producing state-of-the-art material estimation and generation compared to prior methods. The approach advances practical 3D content creation by enabling realistic material recovery and exploration from casually captured images and supports high-quality, relightable renderings at 4K scales.

Abstract

Material reconstruction from a photograph is a key component of 3D content creation democratization. We propose to formulate this ill-posed problem as a controlled synthesis one, leveraging the recent progress in generative deep networks. We present ControlMat, a method which, given a single photograph with uncontrolled illumination as input, conditions a diffusion model to generate plausible, tileable, high-resolution physically-based digital materials. We carefully analyze the behavior of diffusion models for multi-channel outputs, adapt the sampling process to fuse multi-scale information and introduce rolled diffusion to enable both tileability and patched diffusion for high-resolution outputs. Our generative approach further permits exploration of a variety of materials which could correspond to the input image, mitigating the unknown lighting conditions. We show that our approach outperforms recent inference and latent-space-optimization methods, and carefully validate our diffusion process design choices. Supplemental materials and additional details are available at: https://gvecchio.com/controlmat/.

ControlMat: A Controlled Generative Approach to Material Capture

TL;DR

ControlMat tackles material capture from a single photograph under unknown illumination by casting it as a controlled diffusion synthesis problem. It introduces MatGen, a latent diffusion model backed by a VAE, and employs global (CLIP-based) and spatial (ControlNet) conditioning to generate 9 SVBRDF maps with tileable, high-resolution outputs. Innovations including noise rolling, multiscale diffusion, and patched decoding enable consistent patch-based generation and border-inpainting, producing state-of-the-art material estimation and generation compared to prior methods. The approach advances practical 3D content creation by enabling realistic material recovery and exploration from casually captured images and supports high-quality, relightable renderings at 4K scales.

Abstract

Material reconstruction from a photograph is a key component of 3D content creation democratization. We propose to formulate this ill-posed problem as a controlled synthesis one, leveraging the recent progress in generative deep networks. We present ControlMat, a method which, given a single photograph with uncontrolled illumination as input, conditions a diffusion model to generate plausible, tileable, high-resolution physically-based digital materials. We carefully analyze the behavior of diffusion models for multi-channel outputs, adapt the sampling process to fuse multi-scale information and introduce rolled diffusion to enable both tileability and patched diffusion for high-resolution outputs. Our generative approach further permits exploration of a variety of materials which could correspond to the input image, mitigating the unknown lighting conditions. We show that our approach outperforms recent inference and latent-space-optimization methods, and carefully validate our diffusion process design choices. Supplemental materials and additional details are available at: https://gvecchio.com/controlmat/.
Paper Structure (39 sections, 2 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 39 sections, 2 equations, 14 figures, 2 tables, 1 algorithm.

Figures (14)

  • Figure 1: Overview of ControlMat. During training, the PBR maps are compressed into the latent representation $z$ using the encoder $\mathcal{E}$. Noise is then added to $z$ and the denoising is carried out by a U-Net model. The denoising process can be globally conditioned with the CLIP embedding of the prompt (text or image) and/or locally conditioned using the intermediate representation of a target photograph extracted by a ControlNet network. After $n$ denoising steps the new denoised latent vector $\hat{z}$ is projected back to pixel space using the decoder $\mathcal{D}$. We enable high resolution diffusion through splitting the input image in $\mathcal{N}$ patches which are then diffused, decoded and reassembled through patched decoding.
  • Figure 2: Patch diffusion comparison. Examples of height map results using different approaches for patched latent diffusion.
  • Figure 3: Noise rolling. Visual representation of the noise rolling approach. The input is "rolled" over the x and y axes by a random translation, represented in the figure by replicating the image 2x2 and cropping the region contained in the blue square. Unrolling consists in doing the inverse process.
  • Figure 4: Tileable estimation. Visual representation of the tileable estimation via border inpainting and noise rolling approach. We mask the input image border letting the diffusion model entirely regenerate it (blue area in the figure) while estimating the properties of the unmasked area (red area in the figure). In combination with rolling it ensures tileability while keeping the content of the image mostly unaltered.
  • Figure 5: Overview of our patched decoding. Decoding our latent vector $z$ per patch (reducing peak-memory usage) introduces seams between them. We propose to encourage similarity between patches by first decoding a low resolution material and applying a mean matching operation between the corresponding regions. Combined with an overlapping patches blending approach, this prevents the apparition of seams.
  • ...and 9 more figures