Table of Contents
Fetching ...

Advancing high-fidelity 3D and Texture Generation with 2.5D latents

Xin Yang, Jiantao Lin, Yingjie Xu, Haodong Li, Yingcong Chen

TL;DR

The paper tackles the gap between high-quality texture generation and accurate 3D geometry in 3D synthesis by introducing 2.5D latents that fuse multiview RGB, normal, and coordinate information into an image-like representation. It leverages pretrained 2D diffusion priors through a Mixture-of-LoRA framework and a 3D refiner-decoder to produce high-fidelity 3D Gaussians or meshes, while enabling geometry-conditioned texture generation from text or image prompts. Key contributions include the 2.5D latent representation, the Mixture-of-LoRA architecture, and a dedicated 3D refinement pipeline, supported by a curated dataset and comprehensive evaluations showing strong texture-geometry coherence and competitive quantitative performance. This approach offers a data- and compute-efficient path to unified 3D plus texture generation with practical implications for content creation and design pipelines.

Abstract

Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.

Advancing high-fidelity 3D and Texture Generation with 2.5D latents

TL;DR

The paper tackles the gap between high-quality texture generation and accurate 3D geometry in 3D synthesis by introducing 2.5D latents that fuse multiview RGB, normal, and coordinate information into an image-like representation. It leverages pretrained 2D diffusion priors through a Mixture-of-LoRA framework and a 3D refiner-decoder to produce high-fidelity 3D Gaussians or meshes, while enabling geometry-conditioned texture generation from text or image prompts. Key contributions include the 2.5D latent representation, the Mixture-of-LoRA architecture, and a dedicated 3D refinement pipeline, supported by a curated dataset and comprehensive evaluations showing strong texture-geometry coherence and competitive quantitative performance. This approach offers a data- and compute-efficient path to unified 3D plus texture generation with practical implications for content creation and design pipelines.

Abstract

Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.

Paper Structure

This paper contains 23 sections, 2 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Examples of "X" to 3D and "X" to texture generation. In this paper, we propose a new approach to bridge 3D generation with 2D diffusion priors, i.e., the generation of 2.5D latent. By leveraging the advantage of 3D-like representation and the prior of pretrained 2D diffusion models to the best extent, we achieve not only 3D generation with high-fidelity, but also excels in geometry-conditioned texture generation with image or text conditions. Please zoom in for details.
  • Figure 2: The multiview 2.5D representation. In this paper, we curate a 2.5D dataset by rendering the multiview RGB, normal and coordinate maps from 3D assets. With the dataset, we encode the image of each modalities into 2D latents with 2D VAE, and project the latents into 3D voxel space with multiview averaging. Then, we train our 3D refiner and decoder to reconstruct the 3D assets (3D gaussian splats or mesh) from the aggregated structured latents.
  • Figure 3: Our proposed framework. In this paper, we propose a unified framework for (b) text or image to 3D generation and (c) geometry-conditioned texture generation. To provide the (a) hybrid image-text condition for 3D or texture generation, we adopt the off-the-shelf image encoder SigLIP zhai2023sigmoidlosslanguageimage and Flux.1-Redux blackforest2024flux image embedder with the T5 and CLIP text encoder. During the training, we randomly dropout the image or text condition to maintain the model's ability to generate coherent content with "X" conditions and achieve better model performance, please refer to our experiment section for more details.
  • Figure 4: The architecture of proposed modules. In order to fix the occluded areas in the structure latent aggregated from our 2.5D latent, we introduce the (a) 3D Residual UNet to refine the feature and occupancy field. Then, we apply the (b) sparse transformer decoder xiang2024structured for 3DGS or mesh reconstruction. (c) For each MLP block, we introduce three separate LoRA adapters tailored for 2.5D latent generation. Specifically, while the general LoRA layer is shared by all modalities, we feed the normal and coordinate (coord.) features into the auxiliary LoRA layers for extra feature projection, then we merge the results of the general and normal or coord. LoRAs via a simple addition.
  • Figure 5: Qualitative comparison on image-to-3D generation. Please zoom in for detail.
  • ...and 7 more figures