Table of Contents
Fetching ...

TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation

Ziteng Lu, Yushuang Wu, Chongjie Ye, Yuda Qiu, Jing Shao, Xiaoyang Guo, Jiaqing Zhou, Tianlei Hu, Kun Zhou, Xiaoguang Han

TL;DR

TexSpot tackles the problem of view-inconsistency in 3D texture generation by introducing Texlet, a spatially uniform 3D texture latent space that combines point-based geometric cues with compact, patch-based encoding. A two-stage TexSpot VAE learns Texlets from local 2D texture patches and global 3D context via a 3D encoder, followed by cascaded 3D-then-2D decoding to reconstruct texture patches. A Texlet-conditioned diffusion transformer refines textures produced by multi-view diffusion, using a flow-matching objective and classifier-free guidance to boost fidelity and cross-view coherence. The approach yields superior texture detail, global coherence, and robustness over state-of-the-art methods, with strong potential for controllable editing and joint geometry-texture modeling in future work.

Abstract

High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.

TexSpot: 3D Texture Enhancement with Spatially-uniform Point Latent Representation

TL;DR

TexSpot tackles the problem of view-inconsistency in 3D texture generation by introducing Texlet, a spatially uniform 3D texture latent space that combines point-based geometric cues with compact, patch-based encoding. A two-stage TexSpot VAE learns Texlets from local 2D texture patches and global 3D context via a 3D encoder, followed by cascaded 3D-then-2D decoding to reconstruct texture patches. A Texlet-conditioned diffusion transformer refines textures produced by multi-view diffusion, using a flow-matching objective and classifier-free guidance to boost fidelity and cross-view coherence. The approach yields superior texture detail, global coherence, and robustness over state-of-the-art methods, with strong potential for controllable editing and joint geometry-texture modeling in future work.

Abstract

High-quality 3D texture generation remains a fundamental challenge due to the view-inconsistency inherent in current mainstream multi-view diffusion pipelines. Existing representations either rely on UV maps, which suffer from distortion during unwrapping, or point-based methods, which tightly couple texture fidelity to geometric density that limits high-resolution texture generation. To address these limitations, we introduce TexSpot, a diffusion-based texture enhancement framework. At its core is Texlet, a novel 3D texture representation that merges the geometric expressiveness of point-based 3D textures with the compactness of UV-based representation. Each Texlet latent vector encodes a local texture patch via a 2D encoder and is further aggregated using a 3D encoder to incorporate global shape context. A cascaded 3D-to-2D decoder reconstructs high-quality texture patches, enabling the Texlet space learning. Leveraging this representation, we train a diffusion transformer conditioned on Texlets to refine and enhance textures produced by multi-view diffusion methods. Extensive experiments demonstrate that TexSpot significantly improves visual fidelity, geometric consistency, and robustness over existing state-of-the-art 3D texture generation and enhancement approaches. Project page: https://anonymous.4open.science/w/TexSpot-page-2D91.
Paper Structure (22 sections, 10 equations, 9 figures, 2 tables)

This paper contains 22 sections, 10 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: The illustration of three 3D texture representation.
  • Figure 2: The pipeline overview of TexSpot. It consists of (i) a texture patch partitioning that divides the surface texture into spatially-uniform small patches; (ii) a TexSpot VAE with a two-stage local-global architecture that represents all texture patches into a compact 3D latent space; and (iii) a conditional TexSpot DiT based on flow matching for texture enhancement.
  • Figure 3: Visualization of texture reconstruction results by our VAE, with comparisons with ground truth (input) textures.
  • Figure 4: Texture enhancement visualization of our TexSpot for scanned meshes of objects or scenes.
  • Figure 5: Texture Enhancement result for generated textured 3D mesh. We adopt Meshy-6 MeshyAI and Tripo3D-3.0 Tripo3d to generate the coarse texture and use our model for enhancement. Despite the decent texture quality produced by commercial models, our TexSpot consistently improves the quality of intricate details, delivering sharper outputs with fewer artifacts. Please zoom in for a more detailed comparison of the textures.
  • ...and 4 more figures