Text-Guided Texturing by Synchronized Multi-View Diffusion
Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong
TL;DR
The paper tackles the challenge of text-guided, texture-rich texturing of 3D objects without additional training data. It introduces Synchronized Multi-View Diffusion (MVD), which shares latent texture information across overlapping UV regions during every denoising step to align views and prevent seams. The approach leverages UV-space fusion, cosine-based weighting, and self-attention reuse to reach a consensus early and maintain fine details, outperforming state-of-the-art baselines on FID, CLIP, and 3D-consistency, as well as in user studies. Limitations include lighting baked into textures, bottom-view bias from priors, and imperfect depth-boundary handling, suggesting directions for optimization-based boundary methods and bypassing lighting effects. Overall, the method provides a practical, zero-shot framework for coherent, detailed texturing of arbitrary meshes using pre-trained diffusion models.
Abstract
This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods.
