Table of Contents
Fetching ...

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

Kangle Deng, Timothy Omernick, Alexander Weiss, Deva Ramanan, Jun-Yan Zhu, Tinghui Zhou, Maneesh Agrawala

TL;DR

This work proposes a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt that disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment.

Abstract

Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our algorithm is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.

FlashTex: Fast Relightable Mesh Texturing with LightControlNet

TL;DR

This work proposes a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt that disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment.

Abstract

Manually creating textures for 3D meshes is time-consuming, even for expert visual content creators. We propose a fast approach for automatically texturing an input 3D mesh based on a user-provided text prompt. Importantly, our approach disentangles lighting from surface material/reflectance in the resulting texture so that the mesh can be properly relit and rendered in any lighting environment. We introduce LightControlNet, a new text-to-image model based on the ControlNet architecture, which allows the specification of the desired lighting as a conditioning image to the model. Our text-to-texture pipeline then constructs the texture in two stages. The first stage produces a sparse set of visually consistent reference views of the mesh using LightControlNet. The second stage applies a texture optimization based on Score Distillation Sampling (SDS) that works with LightControlNet to increase the texture quality while disentangling surface material from lighting. Our algorithm is significantly faster than previous text-to-texture methods, while producing high-quality and relightable textures.
Paper Structure (12 sections, 10 equations, 12 figures, 6 tables)

This paper contains 12 sections, 10 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: We propose an efficient approach for texturing an input 3D mesh given a user-provided text prompt. Our generated texture can be relit properly in different lighting environments. The light probe shows the varied lighting environment. We suggest the readers check our video results of rotating lighting at our https://flashtex.github.io/.
  • Figure 2: Given a 3D mesh of a helmet (a) and a lighting environment $L$, the reference rendering (b) depicts the "correct" highlights on the mesh due to $L$, by treating its surface reflectance as half-metal and half-smooth with a gray diffuse color. (c) The texture generated by the leading method Fantasia3D Chen_2023fantasia3D is not properly relit as Fantasia3D bakes most of the lighting into the diffuse texture for the mesh and does not capture the bright highlights in the specular texture. (d) In contrast, our pipeline disentangles lighting from material, better capturing the diffuse and specular components of the metal helmet in this environment. Text prompt: "A medieval steel helmet."
  • Figure 3: Text-to-Texture pipeline. Our method efficiently synthesizes relightable textures given a 3D mesh and text prompt. In stage 1 (top left), we use multi-view visual prompting with our LightControlNet to generate four visually consistent canonical views of the mesh under fixed lighting, concatenated into a reference image $I_{\text{ref}}$. In stage 2, we apply a new texture optimization procedure using $I_{\text{ref}}$ as guidance along with a multi-resolution hash-grid representation of the texture $\Gamma(\beta(\cdot))$. For each iteration, we render two batches of images using $\Gamma(\beta(\cdot))$ -- one using the canonical views and lighting of $I_{\text{ref}}$ to compute a reconstruction loss $\mathcal{L}_{\text{recon}}$ and the other using randomly sampled views and lighting to compute an SDS loss $\mathcal{L}_{\text{SDS}}$ based on LightControlNet.
  • Figure 4: (a) LightControlNet requires a conditioning image that specifies desired lighting $L$ for a view $C$ of a 3D mesh. To form the conditioning image, we render the mesh with the desired $L$ and $C$ using three different materials: (1) non-metal, not smooth, (2) half-metal, half-smooth, and (3) pure metal, smooth, and then combine the renderings into a single three-channel image. (b) LightControlNet is a diffusion model conditioned on such light-conditioning images and text prompts.
  • Figure 5: Multi-view visual prompting. (a) When we independently input four canonical conditioning images to LightControlNet, it generates four very different appearances and styles even with a fixed random seed. (b) When we concatenate the four images into a 2$\times$2 grid and pass them as a single image into LightControlNet, it produces a far more consistent appearance and style. Text prompt: "A hiking boot".
  • ...and 7 more figures