Table of Contents
Fetching ...

Text-Guided Texturing by Synchronized Multi-View Diffusion

Yuxin Liu, Minshan Xie, Hanyuan Liu, Tien-Tsin Wong

TL;DR

The paper tackles the challenge of text-guided, texture-rich texturing of 3D objects without additional training data. It introduces Synchronized Multi-View Diffusion (MVD), which shares latent texture information across overlapping UV regions during every denoising step to align views and prevent seams. The approach leverages UV-space fusion, cosine-based weighting, and self-attention reuse to reach a consensus early and maintain fine details, outperforming state-of-the-art baselines on FID, CLIP, and 3D-consistency, as well as in user studies. Limitations include lighting baked into textures, bottom-view bias from priors, and imperfect depth-boundary handling, suggesting directions for optimization-based boundary methods and bypassing lighting effects. Overall, the method provides a practical, zero-shot framework for coherent, detailed texturing of arbitrary meshes using pre-trained diffusion models.

Abstract

This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods.

Text-Guided Texturing by Synchronized Multi-View Diffusion

TL;DR

The paper tackles the challenge of text-guided, texture-rich texturing of 3D objects without additional training data. It introduces Synchronized Multi-View Diffusion (MVD), which shares latent texture information across overlapping UV regions during every denoising step to align views and prevent seams. The approach leverages UV-space fusion, cosine-based weighting, and self-attention reuse to reach a consensus early and maintain fine details, outperforming state-of-the-art baselines on FID, CLIP, and 3D-consistency, as well as in user studies. Limitations include lighting baked into textures, bottom-view bias from priors, and imperfect depth-boundary handling, suggesting directions for optimization-based boundary methods and bypassing lighting effects. Overall, the method provides a practical, zero-shot framework for coherent, detailed texturing of arbitrary meshes using pre-trained diffusion models.

Abstract

This paper introduces a novel approach to synthesize texture to dress up a given 3D object, given a text prompt. Based on the pretrained text-to-image (T2I) diffusion model, existing methods usually employ a project-and-inpaint approach, in which a view of the given object is first generated and warped to another view for inpainting. But it tends to generate inconsistent texture due to the asynchronous diffusion of multiple views. We believe such asynchronous diffusion and insufficient information sharing among views are the root causes of the inconsistent artifact. In this paper, we propose a synchronized multi-view diffusion approach that allows the diffusion processes from different views to reach a consensus of the generated content early in the process, and hence ensures the texture consistency. To synchronize the diffusion, we share the denoised content among different views in each denoising step, specifically blending the latent content in the texture domain from views with overlap. Our method demonstrates superior performance in generating consistent, seamless, highly detailed textures, comparing to state-of-the-art methods.
Paper Structure (23 sections, 3 equations, 10 figures, 1 table)

This paper contains 23 sections, 3 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Left: An illustration of information exchange in the overlapped region (pink region) at intermediate steps of diffusion. Without information exchange, denoising results in different views of the same object could diverge into different directions, leading to seams when projecting to an output texture. Right: To address this issue, we propose a Multi-View Diffusion module that fuses intermediate steps of the denoising process, basing the next denoising step on a consensus of the current step. Here, we illustrate how MVD synchronizes and fuses view information from timesteps $T$ to 0.
  • Figure 2: A zoom-in diagram of the MVD module. Here, denoised views are first projected to partial textures in the UV texture domain and aggregated into a complete, clean latent texture. Then, we can sample the latent texture of the next time step based on this clean texture, and project to screen space to obtain consistent views.
  • Figure 3: An illustration of how the forwardly projected pixels are disjoint in UV space, and how filling and masking are applied to obtain partial textures with large patches of valid texels.
  • Figure 4: Comparison of object texturing results. Text prompts from top to bottom: "photo of Batman, sitting on a rock", "photo of a gray and black Nike airforce high top sneakers", "photo of link in the legend of zelda, photo-realistic, unreal 5", "A cute shiba inu dog" and "blue and white pottery style lucky cat with intricate patterns". Readers are recommended to zoom in for better visualization and comparison. Results from LatentPaint are organized at the end of the paper due to space limit.
  • Figure 5: Gallery of objects textured by our method. Corresponding text prompt is underneath each textured object.
  • ...and 5 more figures