Table of Contents
Fetching ...

RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

Yifei Feng, Mingxin Yang, Shuhui Yang, Sheng Zhang, Jiaao Yu, Zibo Zhao, Yuhong Liu, Jie Jiang, Chunchao Guo

TL;DR

RomanTex addresses the challenge of high-quality, seam-free texture synthesis for 3D assets by fusing 2D diffusion priors with explicit 3D geometry. It introduces a 3D-aware Rotary Positional Embedding to enforce cross-view consistency, a decoupled multi-attention architecture to separate fidelity, diversity, and coherence, and a geometry-related CFG scheme to balance image and geometry guidance during inference. The approach achieves state-of-the-art texture quality and consistency across baselines, aided by extensive quantitative metrics and user studies, while enabling robust back-view texture generation. This work advances practical 3D texture pipelines by tightly integrating geometry and diffusion priors, reducing seams and misalignments in textured 3D assets, and offering a scalable framework for geometry-guided texture synthesis.

Abstract

Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.

RomanTex: Decoupling 3D-aware Rotary Positional Embedded Multi-Attention Network for Texture Synthesis

TL;DR

RomanTex addresses the challenge of high-quality, seam-free texture synthesis for 3D assets by fusing 2D diffusion priors with explicit 3D geometry. It introduces a 3D-aware Rotary Positional Embedding to enforce cross-view consistency, a decoupled multi-attention architecture to separate fidelity, diversity, and coherence, and a geometry-related CFG scheme to balance image and geometry guidance during inference. The approach achieves state-of-the-art texture quality and consistency across baselines, aided by extensive quantitative metrics and user studies, while enabling robust back-view texture generation. This work advances practical 3D texture pipelines by tightly integrating geometry and diffusion priors, reducing seams and misalignments in textured 3D assets, and offering a scalable framework for geometry-guided texture synthesis.

Abstract

Painting textures for existing geometries is a critical yet labor-intensive process in 3D asset generation. Recent advancements in text-to-image (T2I) models have led to significant progress in texture generation. Most existing research approaches this task by first generating images in 2D spaces using image diffusion models, followed by a texture baking process to achieve UV texture. However, these methods often struggle to produce high-quality textures due to inconsistencies among the generated multi-view images, resulting in seams and ghosting artifacts. In contrast, 3D-based texture synthesis methods aim to address these inconsistencies, but they often neglect 2D diffusion model priors, making them challenging to apply to real-world objects To overcome these limitations, we propose RomanTex, a multiview-based texture generation framework that integrates a multi-attention network with an underlying 3D representation, facilitated by our novel 3D-aware Rotary Positional Embedding. Additionally, we incorporate a decoupling characteristic in the multi-attention block to enhance the model's robustness in image-to-texture task, enabling semantically-correct back-view synthesis. Furthermore, we introduce a geometry-related Classifier-Free Guidance (CFG) mechanism to further improve the alignment with both geometries and images. Quantitative and qualitative evaluations, along with comprehensive user studies, demonstrate that our method achieves state-of-the-art results in texture quality and consistency.

Paper Structure

This paper contains 19 sections, 12 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: 3D assets with high quality textures generated by our method.
  • Figure 2: Overview of the proposed texture synthesis framework. Projected geometry conditions and image conditions are incorporated via noise concatenation and reference attention injection, respectively. To enhance multi-view consistency, a multi-view attention block with 3D-aware RoPE is integrated using canonical coordinate maps-based queries.
  • Figure 3: Visual comparion with text-to-texure methods. We simultaneously present two perspectives to compare consistency performance, and the scheme is also extended to the visual comparison of image-to-texture methods.
  • Figure 4: Visual comparion with image-to-texure methods.We conducted zoomed-in visualization of local regions to enable granular evaluation of detail texture quality.
  • Figure 5: Ablation study on core components. We validate the effectiveness of our approach by sequentially disabling individual modules: 3D-aware RoPE, Decoupled Reference Branch, and Geometry-related CFG, showcasing their distinct contributions.