Table of Contents
Fetching ...

CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

Mishan Aliev, Dmitry Baranchuk, Kirill Struminsky

TL;DR

CasTex tackles zero-shot texture synthesis for 3D assets by using cascaded pixel-space diffusion with explicit texture parameterization to produce physically-based textures that render realistically under varied lighting. The method analyzes SDS behavior, showing latent-space guidance can yield out-of-domain artifacts, and adopts a two-stage refinement—coarse texture generation followed by super-resolution—within an end-to-end differentiable shading framework based on the Disney BRDF. On Objaverse, CasTex achieves state-of-the-art results compared to optimization-based baselines, with the two-stage pipeline offering the best FID/KID and perceptual quality, while remaining computationally practical. The work highlights the importance of explicit parameterization, SR refinement, and lighting disentanglement, and suggests future work to further mitigate baked-in lighting artifacts and latent-SDS limitations.

Abstract

This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.

CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading

TL;DR

CasTex tackles zero-shot texture synthesis for 3D assets by using cascaded pixel-space diffusion with explicit texture parameterization to produce physically-based textures that render realistically under varied lighting. The method analyzes SDS behavior, showing latent-space guidance can yield out-of-domain artifacts, and adopts a two-stage refinement—coarse texture generation followed by super-resolution—within an end-to-end differentiable shading framework based on the Disney BRDF. On Objaverse, CasTex achieves state-of-the-art results compared to optimization-based baselines, with the two-stage pipeline offering the best FID/KID and perceptual quality, while remaining computationally practical. The work highlights the importance of explicit parameterization, SR refinement, and lighting disentanglement, and suggests future work to further mitigate baked-in lighting artifacts and latent-SDS limitations.

Abstract

This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.

Paper Structure

This paper contains 19 sections, 5 equations, 14 figures, 5 tables.

Figures (14)

  • Figure 1: Overall pipeline. Our method consists of two stages. In the first stage, given a 3D mesh and a text prompt, we use a differentiable rendering pipeline to generate random views of the model under varying lighting conditions. The texture, initialized randomly, is optimized using the SDS gradient $\nabla_\theta \mathcal{L}_{\text{SDS}}(\theta)$ (Eq. \ref{['eq:sds']}). In the second stage, we refine the texture produced in stage one using SDS with a super-resolution diffusion model. For each camera view and lighting setup, we render two images: one using the fixed texture from the first stage, and one using the current texture being optimized. We denote the SDS gradient adapted to the super-resolution model as $\nabla_\theta \mathcal{L}_{SR-SDS}$. Using the former image as a conditioning input, we backpropagate the SR-SDS gradients through the latter image.
  • Figure 2: Score distillation sampling applied to a super-resolution diffusion model. The horizontal axis indicates optimization progress over time. Top row: the same image $g(\theta)$ is used both as a diffusion-model input $x$ and as the conditioning image $x_{cond}$ for super-resolution. This setup is unstable and causes the optimization to drift away from the original image. Bottom row: the same procedure, but with fixed super-resolution condition $x_{cond} = g(\theta_0)$. The fixed condition anchors the optimization dynamics, preserving the structure of the initial image.
  • Figure 3: Optimization based image generation for a toy single-mode diffusion model. We study the effect of image parameterization and the difference between pixel and latent-space diffusion models. Score Distillation Sampling (SDS) finds the distribution mode for both latent-and pixel-space diffusion models. Top row: the pixel space model achieves near-perfect reconstruction quality measured by PSNR. The latent code $\text{enc}(\theta_{\text{RGB}})$ is also close to the mode and leads to high quality reconstruction. However, for the latent model the optimized parameter $\theta_{\text{RGB}}$ fails to approximate the original image. Bottom row: The implicit regularization proposed in youwang2024paint significantly improves the results for SDS for the latent diffusion model, but still falls short of the image reconstruction given by the decoder or pixel space diffusion model.
  • Figure 4: Qualitative comparison. We synthesized textures for models from the Objaverse dataset using the proposed approach and several recent competing methods. Our method generates seamless textures with softer colors compared with latent diffusion-based approaches.
  • Figure 5: Human preference study comparing our method with competing optimization based and back-projection baselines.
  • ...and 9 more figures