Table of Contents
Fetching ...

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

Shang Liu, Chaohui Yu, Chenjie Cao, Wen Qian, Fan Wang

TL;DR

VCD-Texture tackles the gap between 2D diffusion priors and 3D texture synthesis by introducing a 3D-2D collaborative denoising framework. It unifies 2D and 3D latent learning through Joint Noise Prediction, and fuses multi-view predictions with MV-AR, while Variance Alignment corrects rasterization-induced variance and an inpainting refinement enhances details. The method is validated on a benchmark assembled from Objaverse, ShapeNetSem, and ShapeNet using FID, ClipFID, ClipScore, and ClipVar, showing superior fidelity and cross-view consistency over prior texturing approaches. By leveraging pre-trained diffusion models in a training-free fashion and adding a principled variance correction, the approach delivers high-quality, view-coherent textures with notable efficiency improvements.

Abstract

Recent research on texture synthesis for 3D shapes benefits a lot from dramatically developed 2D text-to-image diffusion models, including inpainting-based and optimization-based approaches. However, these methods ignore the modal gap between the 2D diffusion model and 3D objects, which primarily render 3D objects into 2D images and texture each image separately. In this paper, we revisit the texture synthesis and propose a Variance alignment based 3D-2D Collaborative Denoising framework, dubbed VCD-Texture, to address these issues. Formally, we first unify both 2D and 3D latent feature learning in diffusion self-attention modules with re-projected 3D attention receptive fields. Subsequently, the denoised multi-view 2D latent features are aggregated into 3D space and then rasterized back to formulate more consistent 2D predictions. However, the rasterization process suffers from an intractable variance bias, which is theoretically addressed by the proposed variance alignment, achieving high-fidelity texture synthesis. Moreover, we present an inpainting refinement to further improve the details with conflicting regions. Notably, there is not a publicly available benchmark to evaluate texture synthesis, which hinders its development. Thus we construct a new evaluation set built upon three open-source 3D datasets and propose to use four metrics to thoroughly validate the texturing performance. Comprehensive experiments demonstrate that VCD-Texture achieves superior performance against other counterparts.

VCD-Texture: Variance Alignment based 3D-2D Co-Denoising for Text-Guided Texturing

TL;DR

VCD-Texture tackles the gap between 2D diffusion priors and 3D texture synthesis by introducing a 3D-2D collaborative denoising framework. It unifies 2D and 3D latent learning through Joint Noise Prediction, and fuses multi-view predictions with MV-AR, while Variance Alignment corrects rasterization-induced variance and an inpainting refinement enhances details. The method is validated on a benchmark assembled from Objaverse, ShapeNetSem, and ShapeNet using FID, ClipFID, ClipScore, and ClipVar, showing superior fidelity and cross-view consistency over prior texturing approaches. By leveraging pre-trained diffusion models in a training-free fashion and adding a principled variance correction, the approach delivers high-quality, view-coherent textures with notable efficiency improvements.

Abstract

Recent research on texture synthesis for 3D shapes benefits a lot from dramatically developed 2D text-to-image diffusion models, including inpainting-based and optimization-based approaches. However, these methods ignore the modal gap between the 2D diffusion model and 3D objects, which primarily render 3D objects into 2D images and texture each image separately. In this paper, we revisit the texture synthesis and propose a Variance alignment based 3D-2D Collaborative Denoising framework, dubbed VCD-Texture, to address these issues. Formally, we first unify both 2D and 3D latent feature learning in diffusion self-attention modules with re-projected 3D attention receptive fields. Subsequently, the denoised multi-view 2D latent features are aggregated into 3D space and then rasterized back to formulate more consistent 2D predictions. However, the rasterization process suffers from an intractable variance bias, which is theoretically addressed by the proposed variance alignment, achieving high-fidelity texture synthesis. Moreover, we present an inpainting refinement to further improve the details with conflicting regions. Notably, there is not a publicly available benchmark to evaluate texture synthesis, which hinders its development. Thus we construct a new evaluation set built upon three open-source 3D datasets and propose to use four metrics to thoroughly validate the texturing performance. Comprehensive experiments demonstrate that VCD-Texture achieves superior performance against other counterparts.
Paper Structure (29 sections, 20 equations, 8 figures, 5 tables, 2 algorithms)

This paper contains 29 sections, 20 equations, 8 figures, 5 tables, 2 algorithms.

Figures (8)

  • Figure 1: Results of text-guided 3D shape textures generated by VCD-Texture. Our method could achieve high-quality texture synthesis with simple captions.
  • Figure 2: The framework of VCD-Texture: (a) shows the overall process, including 3D-2D collaborative denoising and inpainting refinement; (b) shows the JNP in SD U-Net; (c) indicates the MV-AR with VA. Note that we only apply the aggregation sub-process of MV-AR to denoised multi-view images $\hat{\mathbf{I}}$ to achieve texture $\hat{\mathbf{I}}^{3D}$.
  • Figure 3: The illustration of inpainting refinement. (a) shows the image view rendered from the initially inconsistent texture $\hat{\mathbf{I}}^{3D}$; (b) represents the dilated inpainting mask rendered from 3D mask $\mathbf{M}$; (c) is the depth map rendered from the input mesh; (d) indicates the updated final texture through our inpainting refinement.
  • Figure 4: Qualitative comparisons of text-guided texture synthesis. Prompts from top to down are: "old and rusty volkswagon beetle", "crocodile skin handbag", "barrel", "half moon chaise", "sausage", "lego", "electric oven".
  • Figure 5: The effectiveness of VA. (a) shows the standard deviation curve of three denoising policies; (b) showcases the qualitative comparison with and without (w.o) VA.
  • ...and 3 more figures