CasTex: Cascaded Text-to-Texture Synthesis via Explicit Texture Maps and Physically-Based Shading
Mishan Aliev, Dmitry Baranchuk, Kirill Struminsky
TL;DR
CasTex tackles zero-shot texture synthesis for 3D assets by using cascaded pixel-space diffusion with explicit texture parameterization to produce physically-based textures that render realistically under varied lighting. The method analyzes SDS behavior, showing latent-space guidance can yield out-of-domain artifacts, and adopts a two-stage refinement—coarse texture generation followed by super-resolution—within an end-to-end differentiable shading framework based on the Disney BRDF. On Objaverse, CasTex achieves state-of-the-art results compared to optimization-based baselines, with the two-stage pipeline offering the best FID/KID and perceptual quality, while remaining computationally practical. The work highlights the importance of explicit parameterization, SR refinement, and lighting disentanglement, and suggests future work to further mitigate baked-in lighting artifacts and latent-SDS limitations.
Abstract
This work investigates text-to-texture synthesis using diffusion models to generate physically-based texture maps. We aim to achieve realistic model appearances under varying lighting conditions. A prominent solution for the task is score distillation sampling. It allows recovering a complex texture using gradient guidance given a differentiable rasterization and shading pipeline. However, in practice, the aforementioned solution in conjunction with the widespread latent diffusion models produces severe visual artifacts and requires additional regularization such as implicit texture parameterization. As a more direct alternative, we propose an approach using cascaded diffusion models for texture synthesis (CasTex). In our setup, score distillation sampling yields high-quality textures out-of-the box. In particular, we were able to omit implicit texture parameterization in favor of an explicit parameterization to improve the procedure. In the experiments, we show that our approach significantly outperforms state-of-the-art optimization-based solutions on public texture synthesis benchmarks.
