Table of Contents
Fetching ...

StyleTex: Style Image-Guided Texture Generation for 3D Models

Zhiyu Xie, Yuqing Zhang, Xiangjun Tang, Yiqian Wu, Dehan Chen, Gongsheng Li, Xaogang Jin

TL;DR

The novel approach to disentangling the reference image's style and content information allows the resulting textures generated by StyleTex to retain the style of the reference image, while also aligning with the text prompts and intrinsic details of the given 3D mesh.

Abstract

Style-guided texture generation aims to generate a texture that is harmonious with both the style of the reference image and the geometry of the input mesh, given a reference style image and a 3D mesh with its text description. Although diffusion-based 3D texture generation methods, such as distillation sampling, have numerous promising applications in stylized games and films, it requires addressing two challenges: 1) decouple style and content completely from the reference image for 3D models, and 2) align the generated texture with the color tone, style of the reference image, and the given text prompt. To this end, we introduce StyleTex, an innovative diffusion-model-based framework for creating stylized textures for 3D models. Our key insight is to decouple style information from the reference image while disregarding content in diffusion-based distillation sampling. Specifically, given a reference image, we first decompose its style feature from the image CLIP embedding by subtracting the embedding's orthogonal projection in the direction of the content feature, which is represented by a text CLIP embedding. Our novel approach to disentangling the reference image's style and content information allows us to generate distinct style and content features. We then inject the style feature into the cross-attention mechanism to incorporate it into the generation process, while utilizing the content feature as a negative prompt to further dissociate content information. Finally, we incorporate these strategies into StyleTex to obtain stylized textures. The resulting textures generated by StyleTex retain the style of the reference image, while also aligning with the text prompts and intrinsic details of the given 3D mesh. Quantitative and qualitative experiments show that our method outperforms existing baseline methods by a significant margin.

StyleTex: Style Image-Guided Texture Generation for 3D Models

TL;DR

The novel approach to disentangling the reference image's style and content information allows the resulting textures generated by StyleTex to retain the style of the reference image, while also aligning with the text prompts and intrinsic details of the given 3D mesh.

Abstract

Style-guided texture generation aims to generate a texture that is harmonious with both the style of the reference image and the geometry of the input mesh, given a reference style image and a 3D mesh with its text description. Although diffusion-based 3D texture generation methods, such as distillation sampling, have numerous promising applications in stylized games and films, it requires addressing two challenges: 1) decouple style and content completely from the reference image for 3D models, and 2) align the generated texture with the color tone, style of the reference image, and the given text prompt. To this end, we introduce StyleTex, an innovative diffusion-model-based framework for creating stylized textures for 3D models. Our key insight is to decouple style information from the reference image while disregarding content in diffusion-based distillation sampling. Specifically, given a reference image, we first decompose its style feature from the image CLIP embedding by subtracting the embedding's orthogonal projection in the direction of the content feature, which is represented by a text CLIP embedding. Our novel approach to disentangling the reference image's style and content information allows us to generate distinct style and content features. We then inject the style feature into the cross-attention mechanism to incorporate it into the generation process, while utilizing the content feature as a negative prompt to further dissociate content information. Finally, we incorporate these strategies into StyleTex to obtain stylized textures. The resulting textures generated by StyleTex retain the style of the reference image, while also aligning with the text prompts and intrinsic details of the given 3D mesh. Quantitative and qualitative experiments show that our method outperforms existing baseline methods by a significant margin.

Paper Structure

This paper contains 31 sections, 12 equations, 14 figures, 2 tables.

Figures (14)

  • Figure 1: Overview of our pipeline. StyleTex's inputs include a reference style image $I_{ref}$, a text prompt $y$, and an untextured 3D mesh $\mathcal{M}$. During training, we utilize our innovative ODCR method (described in Sec. \ref{['sec: ism']}) to extract a content-unrelated style feature, $f_s^{ref}$, from the reference image. The style feature and text embeddings are fed into the Unet to guide the optimization of the texture field. During inference, texture maps can be sampled from the texture field and directly employed in downstream game or film production, enabling the creation of stylized digital environments.
  • Figure 2: Ablation study on style guidance. (a) Baseline for text-to-texture. (b) Use "in xxx style" text prompts for style guidance. (c) Add the whole image prompt as guidance. (d) Add our style guidance strategy. (e) Add content embedding of the reference image as a negative prompt. (f) Full model with the style guidance strategy and content embedding of the reference image as a negative prompt.
  • Figure 3: Stylized texture results obtained using various transformer layer style injection strategies. The Prompts are "a cupcake in ice and snow covered style" and "a wooden treasure chest with metal accents and locks in colorful drawing style".
  • Figure 4: Results using our style-content decoupling method with SDS loss (a) and ISM loss (b) for the prompts "a strawberry/teapot in colorful graffiti style" and "a strawberry/teapot in Chinese ink paint style".
  • Figure 5: Stylized texture results achieved using different content removal strategies in CLIP space. The prompts are "a hand bag in watercolor sketch style" and "a pot in a colorful painting style".
  • ...and 9 more figures