Table of Contents
Fetching ...

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

Dong Huo, Zixin Guo, Xinxin Zuo, Zhihao Shi, Juwei Lu, Peng Dai, Songcen Xu, Li Cheng, Yee-Hong Yang

TL;DR

TexGen tackles the challenge of text-driven 3D texture synthesis by leveraging a pre-trained 2D diffusion model within a multi-view framework. It introduces a time-evolving UV texture map updated at each denoising step, coupled with an attention-guided cross-view sampling and a Text&Texture-Guided Resampling strategy to preserve view consistency while retaining rich detail. The method demonstrates superior texture quality and consistency across diverse meshes, outperforming state-of-the-art baselines in qualitative and quantitative evaluations and enabling texture editing that preserves identity. While achieving notable improvements, the work notes remaining gaps relative to 2D texture quality and highlights future work on disentangling material and lighting effects.

Abstract

Given a 3D mesh, we aim to synthesize 3D textures that correspond to arbitrary textual descriptions. Current methods for generating and assembling textures from sampled views often result in prominent seams or excessive smoothing. To tackle these issues, we present TexGen, a novel multi-view sampling and resampling framework for texture generation leveraging a pre-trained text-to-image diffusion model. For view consistent sampling, first of all we maintain a texture map in RGB space that is parameterized by the denoising step and updated after each sampling step of the diffusion model to progressively reduce the view discrepancy. An attention-guided multi-view sampling strategy is exploited to broadcast the appearance information across views. To preserve texture details, we develop a noise resampling technique that aids in the estimation of noise, generating inputs for subsequent denoising steps, as directed by the text prompt and current texture map. Through an extensive amount of qualitative and quantitative evaluations, we demonstrate that our proposed method produces significantly better texture quality for diverse 3D objects with a high degree of view consistency and rich appearance details, outperforming current state-of-the-art methods. Furthermore, our proposed texture generation technique can also be applied to texture editing while preserving the original identity. More experimental results are available at https://dong-huo.github.io/TexGen/

TexGen: Text-Guided 3D Texture Generation with Multi-view Sampling and Resampling

TL;DR

TexGen tackles the challenge of text-driven 3D texture synthesis by leveraging a pre-trained 2D diffusion model within a multi-view framework. It introduces a time-evolving UV texture map updated at each denoising step, coupled with an attention-guided cross-view sampling and a Text&Texture-Guided Resampling strategy to preserve view consistency while retaining rich detail. The method demonstrates superior texture quality and consistency across diverse meshes, outperforming state-of-the-art baselines in qualitative and quantitative evaluations and enabling texture editing that preserves identity. While achieving notable improvements, the work notes remaining gaps relative to 2D texture quality and highlights future work on disentangling material and lighting effects.

Abstract

Given a 3D mesh, we aim to synthesize 3D textures that correspond to arbitrary textual descriptions. Current methods for generating and assembling textures from sampled views often result in prominent seams or excessive smoothing. To tackle these issues, we present TexGen, a novel multi-view sampling and resampling framework for texture generation leveraging a pre-trained text-to-image diffusion model. For view consistent sampling, first of all we maintain a texture map in RGB space that is parameterized by the denoising step and updated after each sampling step of the diffusion model to progressively reduce the view discrepancy. An attention-guided multi-view sampling strategy is exploited to broadcast the appearance information across views. To preserve texture details, we develop a noise resampling technique that aids in the estimation of noise, generating inputs for subsequent denoising steps, as directed by the text prompt and current texture map. Through an extensive amount of qualitative and quantitative evaluations, we demonstrate that our proposed method produces significantly better texture quality for diverse 3D objects with a high degree of view consistency and rich appearance details, outperforming current state-of-the-art methods. Furthermore, our proposed texture generation technique can also be applied to texture editing while preserving the original identity. More experimental results are available at https://dong-huo.github.io/TexGen/
Paper Structure (17 sections, 12 equations, 9 figures, 3 tables)

This paper contains 17 sections, 12 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Given a 3D mesh, we present text-driven texture generation results from previous state-of-the-art approaches (TEXTure richardson2023texture, Text2Tex chen2023text2tex, Fantasia3D chen2023fantasia3d, and ProlificDreamer wang2023prolificdreamer) as well as our proposed method.
  • Figure 2: Overview of our proposed method, where AGVS and T$\mathbf{^2}$GR denote Attention-Guided View Sampling and Text&Texture-Guided Resampling, respectively. First of all, we sample $N$ viewpoints across the objects. As shown in (a), our texture sampling strategy is an interleaved process of texture generation and diffusion denoising. Specifically, our texture sampling process is structured into $T$ desnoising steps of diffusion process, and a complete RGB texture map ($\hat{U}_{t}^N$) is generated at the end of each step. As shown in (b), at denoising step $t$, each AGVS module receives noisy latent features $x_{t}^i$ as input to sample an image and produce a partial texture map$\hat{U}_{t}^i$, along with noise estimation $\epsilon_\theta(x_t^i)$. The generated $\hat{U}_{t}^i$ serves as guidance for sampling the subsequent view. Subsequently, a complete texture map$\hat{U}_{t}^N$ is employed to refine the noise estimation of each view within T$^2$GR modules, facilitating the prediction of noisy features for the ensuing denoising step ($x_{t-1}^{1...N}$).
  • Figure 3: Details of denoising for view $i+1$ at step $t$. The AGVS module is designed to generate denoised observation $\hat{x}_0^{i+1}(x_t^{i+1})$ which will be assembled onto UV space to form intermediate texture $\hat{U}_{t}^{i+1}$. The attention guidance is omitted in the figure for simplification. After iterating over all sampled views starting from $i=1$ to $N$, we obtain a complete texture map $\hat{U}_{t}^N$ for each denoising step. Conditioned on the current aggragated texture map, the T$^2$GR module will update the noise estimation $\epsilon_\theta(x_t^i)$ with the multi-conditioned classifier-free guidance (CFG) to calculate the noisy latent feature $x_{t-1}^{i+1}$ of the next denoising step.
  • Figure 4: (a) Denoised observation$\hat{x_0}(x_t^i)$ of different denoising steps. The high-frequency information is gradually generated during sampling. (b) We claimed that the over-smoothness of directly using Eq. \ref{['eqn:correct']} for noise sampling is caused by repeatedly going through VAE decoder and encoder at each denoising step. For validation, we conducted an ablation for a simplified case of generating only a single viewpoint. It shows that the over-smoothness still existed even for single view generation. Mathematically, if we do not have encoding and decoding operation at each denoising step, single view sampling is exactly same as DDIM sampling.
  • Figure 5: Visual comparison of our proposed method against TEXTure richardson2023texture and Text2Tex chen2023text2tex.
  • ...and 4 more figures