Table of Contents
Fetching ...

TexPainter: Generative Mesh Texturing with Multi-view Consistency

Hongkun Zhang, Zherong Pan, Congyi Zhang, Lifeng Zhu, Xifeng Gao

TL;DR

A novel method to enforce multi-view consistency using an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation, which further relaxes the sequential dependency assumption among the camera views.

Abstract

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: https://github.com/Quantuman134/TexPainter

TexPainter: Generative Mesh Texturing with Multi-view Consistency

TL;DR

A novel method to enforce multi-view consistency using an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation, which further relaxes the sequential dependency assumption among the camera views.

Abstract

The recent success of pre-trained diffusion models unlocks the possibility of the automatic generation of textures for arbitrary 3D meshes in the wild. However, these models are trained in the screen space, while converting them to a multi-view consistent texture image poses a major obstacle to the output quality. In this paper, we propose a novel method to enforce multi-view consistency. Our method is based on the observation that latent space in a pre-trained diffusion model is noised separately for each camera view, making it difficult to achieve multi-view consistency by directly manipulating the latent codes. Based on the celebrated Denoising Diffusion Implicit Models (DDIM) scheme, we propose to use an optimization-based color-fusion to enforce consistency and indirectly modify the latent codes by gradient back-propagation. Our method further relaxes the sequential dependency assumption among the camera views. By evaluating on a series of general 3D models, we find our simple approach improves consistency and overall quality of the generated textures as compared to competing state-of-the-arts. Our implementation is available at: https://github.com/Quantuman134/TexPainter
Paper Structure (18 sections, 8 equations, 9 figures, 2 tables, 1 algorithm)

This paper contains 18 sections, 8 equations, 9 figures, 2 tables, 1 algorithm.

Figures (9)

  • Figure 1: We run diffusion processes from two nearby views and enforce consistency by blending the noisy latent code into $I_z$ during each step. Under a low-res $I_z$, two views are correlated by sampling largely the same set of texels, thus achieving multi-view consistency, but low-res $I_z$ leads to low-quality blurry images (middle). Instead, clear images are derived under a high-res $I_z$, but the two views can fetch entirely different sets of texels due to the nearest sampling scheme used by SIMS, failing to achieve consistency (right).
  • Figure 2: Our modified multi-DDIM procedure that enforces multi-view consistency. Each view runs a separate denoising procedure using DDIM scheme. For each denoising step, DDIM predicts a latent code $\hat{z}_{0,t}^i$ for the $i$th view at $0$th timestep. These $\hat{z}_{0,t}^i$ are decoded to the color space, yielding $\hat{x}_{0,t}^i$. We then blend these views into a common color-space texture image by weighted averaging. Next, we perform an optimization to update $\hat{z}_{0,t}^i$ into $\bar{z}_{0,t}^i$ for all views, such that their decoded images match their corresponding rendered views using the blended texture image. These updated latent codes are then plugged into DDIM to predict the next noise level.
  • Figure 3: We highlight the benefits of our joint optimization Equation \ref{['eq:FuseZFineTune']} (left) as compared with Equation \ref{['eq:FuseX']} (right). The joint optimization achieves better texture quality in areas not well-sampled by camera views, but increases the inference cost from 25min to 66min.
  • Figure 4: A car model with its generated texture from a global prompt (top) and different prompts for different local regions (bottom).
  • Figure 5: Generated textures from different prompts.
  • ...and 4 more figures