Table of Contents
Fetching ...

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

Jiawei Lu, Yingpeng Zhang, Zengjun Zhao, He Wang, Kun Zhou, Tianjia Shao

TL;DR

A novel text-to-texture synthesis framework that takes advantage of pre-trained diffusion models and introduces a local attention reweighing mechanism in the self-attention layers to guide the model in focusing on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency.

Abstract

Large-scale text-guided image diffusion models have shown astonishing results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-and-inpainting approach managed to preserve generation diversity but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that leverages pretrained diffusion models. We first introduce a local attention reweighing mechanism in the self-attention layers to guide the model in concentrating on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques regarding texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.

GenesisTex2: Stable, Consistent and High-Quality Text-to-Texture Generation

TL;DR

A novel text-to-texture synthesis framework that takes advantage of pre-trained diffusion models and introduces a local attention reweighing mechanism in the self-attention layers to guide the model in focusing on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency.

Abstract

Large-scale text-guided image diffusion models have shown astonishing results in text-to-image (T2I) generation. However, applying these models to synthesize textures for 3D geometries remains challenging due to the domain gap between 2D images and textures on a 3D surface. Early works that used a projecting-and-inpainting approach managed to preserve generation diversity but often resulted in noticeable artifacts and style inconsistencies. While recent methods have attempted to address these inconsistencies, they often introduce other issues, such as blurring, over-saturation, or over-smoothing. To overcome these challenges, we propose a novel text-to-texture synthesis framework that leverages pretrained diffusion models. We first introduce a local attention reweighing mechanism in the self-attention layers to guide the model in concentrating on spatial-correlated patches across different views, thereby enhancing local details while preserving cross-view consistency. Additionally, we propose a novel latent space merge pipeline, which further ensures consistency across different viewpoints without sacrificing too much diversity. Our method significantly outperforms existing state-of-the-art techniques regarding texture consistency and visual quality, while delivering results much faster than distillation-based methods. Importantly, our framework does not require additional training or fine-tuning, making it highly adaptable to a wide range of models available on public platforms.
Paper Structure (30 sections, 12 equations, 18 figures, 1 table, 1 algorithm)

This paper contains 30 sections, 12 equations, 18 figures, 1 table, 1 algorithm.

Figures (18)

  • Figure 1: Given a mesh and a textual prompt, we aim to produce textures that well depict the prompt and suit the shape. To achieve this, we propose a local attention technique in Sec. \ref{['sec:method-3d aware']}, which enhances local details by reweighing the original self-attention layers based on the 3D shape. In addition, we introduce a framework for consistent texture synthesis in Sec. \ref{['sec:method3.3']}, which includes a latent merge pipeline and an efficient texture dilation algorithm, enabling the stable generation of consistent and high-quality textures.
  • Figure 2: A visualization of attention maps concerning the query patch (in red star). The upper part illustrates the rendered position map and calculated weight map. The bottom part shows the attention map of different layers before and after reweighed by the weight map. $B\{i\}T\{j\}$ stands for the $i$-th Block and $j$-th Transformer layer in the output layers.
  • Figure 3: Results of different attention mechanisms for 4-view diffusion with prompt: A cute shiba inu dog. Images in row 1 are generated without cross-view attention and exhibit no consistency. Results using Global Attention (row 2) are consistent but lose color diversity and details. Images with Local Attention (row 3-5) show improvements in diversity and details, all while maintaining a significant level of cross-view consistency. We find that setting $o=2$ achieves better diversity while eliminating artifacts with $o=8$.
  • Figure 4: Qualitative comparison with different baselines.
  • Figure 5: Ablation results on local attention and latent merge. The left three columns show the generated images, and the last column depicts the rendered result with synthesized texture.
  • ...and 13 more figures