Table of Contents
Fetching ...

MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

Ahmet Burak Yildirim, Mustafa Utku Aydogdu, Duygu Ceylan, Aysegul Dundar

TL;DR

MD-ProjTex tackles the challenge of text-guided texture generation for arbitrary 3D shapes without model training or run-time optimization. It introduces a UV-space multi-diffusion framework that fuses per-view denoising directions across multiple viewpoints, enabling parallel texture generation with strong multi-view consistency. Key innovations include encoder–decoder–denoising with Modified Denoising Steps, multi-scale texture generation, normal-guided weighting, camera-view selection via K-Means, and simple post-processing, all operating without training a new model. Empirically, the method achieves superior FID/KID scores and faster runtimes than state-of-the-art baselines, with user studies confirming perceptual preferences for the generated textures, making it practical for fast, high-quality 3D texture synthesis.

Abstract

We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

MD-ProjTex: Texturing 3D Shapes with Multi-Diffusion Projection

TL;DR

MD-ProjTex tackles the challenge of text-guided texture generation for arbitrary 3D shapes without model training or run-time optimization. It introduces a UV-space multi-diffusion framework that fuses per-view denoising directions across multiple viewpoints, enabling parallel texture generation with strong multi-view consistency. Key innovations include encoder–decoder–denoising with Modified Denoising Steps, multi-scale texture generation, normal-guided weighting, camera-view selection via K-Means, and simple post-processing, all operating without training a new model. Empirically, the method achieves superior FID/KID scores and faster runtimes than state-of-the-art baselines, with user studies confirming perceptual preferences for the generated textures, making it practical for fast, high-quality 3D texture synthesis.

Abstract

We introduce MD-ProjTex, a method for fast and consistent text-guided texture generation for 3D shapes using pretrained text-to-image diffusion models. At the core of our approach is a multi-view consistency mechanism in UV space, which ensures coherent textures across different viewpoints. Specifically, MD-ProjTex fuses noise predictions from multiple views at each diffusion step and jointly updates the per-view denoising directions to maintain 3D consistency. In contrast to existing state-of-the-art methods that rely on optimization or sequential view synthesis, MD-ProjTex is computationally more efficient and achieves better quantitative and qualitative results.

Paper Structure

This paper contains 12 sections, 8 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Our framework can texture 3D models based on given prompts. It is training-free and fast due to the parallel generation of multiple views with the multi-diffusion approach applied to the projected textures.
  • Figure 2: We use a latent-based depth and lineart conditioned diffusion model in our pipeline for the denoising steps. At the same time, the multi-diffusion step takes place in the projected textures. To facilitate this process, we incorporate encoder (E), denoising (SD), decoder (D), and projection steps. For simplicity, only three views are shown in the multi-view input visualization. Please note that $z_t$, $z_{t-1}$, and $z_0$ are features in latent space represented as $4\times64\times64$. These features are not directly visualizable. To enhance clarity in the figure, we use downsampled images instead of them. Conversely, $x_0$ exists in image space and is directly visualizable. Starting from $z_T$, we initialize with a normal distribution. Subsequently, the framework employs a pipeline involving denoising, decoding, projection, and encoding for the subsequent steps.
  • Figure 3: Ablation results on Stable Diffusion (SD) without the multi-view consistency component are presented in this figure. These experiments aim to demonstrate how a ControlNet-based stable diffusion model generates images. Since the multi-view consistency component is not used in these experiments, we do not expect consistency across views. (a) Visual results show the outputs of the original implementation of the SD model, which produces diverse results with a realistic color palette. (b) In the conventional latent diffusion model setup, denoising steps are applied sequentially in the latent space. Once the latent features are denoised, they are decoded to generate an image. However, in our scenario, during the denoising step, we need to decode the image to apply the multi-view consistency component to the projected textures and then re-encode it to resume the denoising process. In this experiment, without the multi-view consistency component, we only add the encoder-decoder pipeline. In each denoising step, we decode the prediction, move to the RGB image space, and then encode it back into the latent space. This results in color saturation, leading to purple and pinkish hues. However, for our multi-view consistency component, we are focused on moving to the image space during denoising, as we will use the UV map for averaging. (c) As a result, we modify the denoising steps as given in Sec. \ref{['sec:subdenosing']}. Even with the encoder and decoder in the pipeline, the output images generated resemble those produced by the original implementation.
  • Figure 4: (a) Visualization of normal maps generated from renderings taken from different camera views. (b) Based on the normal maps, we assign reliability values to various views for a specific pixel on the UV map, with direct views receiving higher reliability scores. (c) Visualization of the texturing results from different camera positions.
  • Figure 5: Qualitative results of our and competing methods with multi-view renderings.
  • ...and 7 more figures