Table of Contents
Fetching ...

TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

Yifei Zeng, Yajie Bao, Jiachen Qian, Shuang Wu, Youtian Lin, Hao Zhu, Buyu Li, Feihu Zhang, Xun Cao, Yao Yao

TL;DR

<3-5 sentence high-level summary> TEXTRIX introduces a native 3D texture generation framework that operates directly in a sparse latent 3D attribute grid, bypassing multi-view fusion and UV-space seams. It combines a Sparse VAE with a Diffusion Transformer equipped with sparse latent conditioning to generate high-fidelity textures and to perform precise 3D segmentation, all within a unified native representation. The approach demonstrates state-of-the-art performance on texture generation and 3D part segmentation across complex meshes and supports extensibility to PBR materials. Ablation studies confirm the critical role of the sparse conditioning and the rendering-based training objectives in achieving high fidelity and coherent cross-view results.

Abstract

Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.

TEXTRIX: Latent Attribute Grid for Native Texture Generation and Beyond

TL;DR

<3-5 sentence high-level summary> TEXTRIX introduces a native 3D texture generation framework that operates directly in a sparse latent 3D attribute grid, bypassing multi-view fusion and UV-space seams. It combines a Sparse VAE with a Diffusion Transformer equipped with sparse latent conditioning to generate high-fidelity textures and to perform precise 3D segmentation, all within a unified native representation. The approach demonstrates state-of-the-art performance on texture generation and 3D part segmentation across complex meshes and supports extensibility to PBR materials. Ablation studies confirm the critical role of the sparse conditioning and the rendering-based training objectives in achieving high fidelity and coherent cross-view results.

Abstract

Prevailing 3D texture generation methods, which often rely on multi-view fusion, are frequently hindered by inter-view inconsistencies and incomplete coverage of complex surfaces, limiting the fidelity and completeness of the generated content. To overcome these challenges, we introduce TEXTRIX, a native 3D attribute generation framework for high-fidelity texture synthesis and downstream applications such as precise 3D part segmentation. Our approach constructs a latent 3D attribute grid and leverages a Diffusion Transformer equipped with sparse attention, enabling direct coloring of 3D models in volumetric space and fundamentally avoiding the limitations of multi-view fusion. Built upon this native representation, the framework naturally extends to high-precision 3D segmentation by training the same architecture to predict semantic attributes on the grid. Extensive experiments demonstrate state-of-the-art performance on both tasks, producing seamless, high-fidelity textures and accurate 3D part segmentation with precise boundaries.

Paper Structure

This paper contains 26 sections, 5 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: Limitations of prevailing paradigms. Left: TRELLIS exhibits inconsistency and low fidelity on the lips. Middle: Multi-view fusion of Hunyuan3D 3.0 results in noticeable seams and blurring around the mouth and neck. Right: Our native 3D framework (TEXTRIX) produces a seamless, high-fidelity result.
  • Figure 2: The VAE of our method. We introduce a native latent 3D attribute representation, which contains properties such as color, semantic labels, and PBR materials of the object within each sparse voxel. An end-to-end attribute VAE is employed to encode the representation into a continuous and compact latent space.
  • Figure 3: Architecture of our Diffusion Transformer (DiT). This image-conditioned model operates on sparse latents to perform unified generation (texturing) and perception (segmentation). We introduce a novel sparse latent conditioning strategy that ensures high-fidelity alignment with the input image.
  • Figure 4: Qualitative comparison of single-view texture generation. Our method demonstrates superior fidelity, consistency with the input view, and freedom from artifacts compared to existing SOTA methods.
  • Figure 5: Qualitative comparison of multi-view conditioned generation compared to commercial methods. Our method demonstrates seamless results with fewer artifacts in the occluded region.
  • ...and 4 more figures