Table of Contents
Fetching ...

TexSliders: Diffusion-Based Texture Editing in CLIP Space

Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, Valentin Deschaintre

TL;DR

TexSliders addresses the challenge of texture editing with diffusion models by shifting editing from cross-attention to the CLIP image-embedding space, guided by a texture-domain prior. It defines semantic editing directions from pairs of plain prompts, computes robust, identity-preserving sliders via per-dimension statistics, and applies edits through a diffusion model conditioned on texture embeddings. The approach yields tileable textures without re-training or ground-truth data and supports composing multiple sliders, with strong qualitative and quantitative results that surpass general-purpose diffusion-editing methods on textures. This work enables intuitive, zero-shot texture manipulation suitable for 3D pipelines and design, offering practical evidence that image-embedding conditioning and domain priors can unlock reliable texture editing at scale.

Abstract

Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.

TexSliders: Diffusion-Based Texture Editing in CLIP Space

TL;DR

TexSliders addresses the challenge of texture editing with diffusion models by shifting editing from cross-attention to the CLIP image-embedding space, guided by a texture-domain prior. It defines semantic editing directions from pairs of plain prompts, computes robust, identity-preserving sliders via per-dimension statistics, and applies edits through a diffusion model conditioned on texture embeddings. The approach yields tileable textures without re-training or ground-truth data and supports composing multiple sliders, with strong qualitative and quantitative results that surpass general-purpose diffusion-editing methods on textures. This work enables intuitive, zero-shot texture manipulation suitable for 3D pipelines and design, offering practical evidence that image-embedding conditioning and domain priors can unlock reliable texture editing at scale.

Abstract

Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.
Paper Structure (23 sections, 3 equations, 9 figures, 1 table)

This paper contains 23 sections, 3 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Visualization of cross-attention maps. We show maps at the last diffusion step of SD 1.4 rombach2021stablediff, given two different input prompts. Top: "a cute panda eating pizza" (non-texture). Bottom: "a texture of small stones" (texture). The attention maps contain interesting semantic information for the panda image, but fail to capture the texture structure.
  • Figure 2: Text vs. image conditioning in diffusion models. Images generated by a diffusion model conditioned with two text prompts (top rows) and images (bottom rows). In both cases, each column represents a different seed when sampling the diffusion model. Text conditioning, even with a specific prompt (second row), maps to a larger region in appearance space and can thus result in many different visual identities, while image conditioning maps the result to a more specific appearance. Text conditioning is done with SD 1.4 rombach2021stablediff, and image conditioning is done with the Latent Diffusion Model of aggarwal2023Backdrop.
  • Figure 3: Overview of our diffusion-based texture editing approach.Top row: Our approach leverages a diffusion prior model $\mathcal{P}$ to convert text embeddings to image embeddings, enabling the use of an image-conditioned pre-trained diffusion model $\mathcal{D}$. Bottom row: To perform the desired edits, we first compute direction $\mathbf{d'}\in \mathbb{R}^{768}$ (red arrow) as the difference between the centroids of the clusters (two small red crosses) formed by the image embeddings of the two prompts that define the edit (e.g., "metal" to "rusty metal"). Naively applying this direction to a specific texture $e_0$ (highlighted in blue), leads to significant identity variations as we march along such direction (left). Instead, we select a subset of $n$ relevant dimensions ($n < 768$) that do contribute to the desired edit, leading to our final editing direction $\mathbf{d}$ (green arrow), which preserves the identity of the input texture (right). We represent the high-dimensional CLIP image embedding space in 2D for visualization purposes.
  • Figure 4: Qualitative results. We show our method on different kinds of materials for various editing directions. We can see that our method applies convincing editing, including when extrapolating the directions (two leftmost columns and rightmost column), and preserves well the texture identity. Here we use generated input textures; Figure \ref{['fig:real_images']} and the Supplemental Material provide results also on photographed inputs.
  • Figure 5: Ablation study. We show qualitative ablations for the direction "metal" to "rusty metal" (positive and negative). We compare to using a single image embedding for original and target prompts ($n_e = 1$, top row) and to using direction $\mathbf{d'}$, computed for multiple image embeddings ($n_e = 150$), but including all the 768 dimensions of the direction (middle row). Both options show a higher identity shift than our approach (bottom row), which successfully edits the texture maintaining its original structure.
  • ...and 4 more figures