TexSliders: Diffusion-Based Texture Editing in CLIP Space
Julia Guerrero-Viu, Milos Hasan, Arthur Roullier, Midhun Harikumar, Yiwei Hu, Paul Guerrero, Diego Gutierrez, Belen Masia, Valentin Deschaintre
TL;DR
TexSliders addresses the challenge of texture editing with diffusion models by shifting editing from cross-attention to the CLIP image-embedding space, guided by a texture-domain prior. It defines semantic editing directions from pairs of plain prompts, computes robust, identity-preserving sliders via per-dimension statistics, and applies edits through a diffusion model conditioned on texture embeddings. The approach yields tileable textures without re-training or ground-truth data and supports composing multiple sliders, with strong qualitative and quantitative results that surpass general-purpose diffusion-editing methods on textures. This work enables intuitive, zero-shot texture manipulation suitable for 3D pipelines and design, offering practical evidence that image-embedding conditioning and domain priors can unlock reliable texture editing at scale.
Abstract
Generative models have enabled intuitive image creation and manipulation using natural language. In particular, diffusion models have recently shown remarkable results for natural image editing. In this work, we propose to apply diffusion techniques to edit textures, a specific class of images that are an essential part of 3D content creation pipelines. We analyze existing editing methods and show that they are not directly applicable to textures, since their common underlying approach, manipulating attention maps, is unsuitable for the texture domain. To address this, we propose a novel approach that instead manipulates CLIP image embeddings to condition the diffusion generation. We define editing directions using simple text prompts (e.g., "aged wood" to "new wood") and map these to CLIP image embedding space using a texture prior, with a sampling-based approach that gives us identity-preserving directions in CLIP space. To further improve identity preservation, we project these directions to a CLIP subspace that minimizes identity variations resulting from entangled texture attributes. Our editing pipeline facilitates the creation of arbitrary sliders using natural language prompts only, with no ground-truth annotated data necessary.
