Advancing high-fidelity 3D and Texture Generation with 2.5D latents
Xin Yang, Jiantao Lin, Yingjie Xu, Haodong Li, Yingcong Chen
TL;DR
The paper tackles the gap between high-quality texture generation and accurate 3D geometry in 3D synthesis by introducing 2.5D latents that fuse multiview RGB, normal, and coordinate information into an image-like representation. It leverages pretrained 2D diffusion priors through a Mixture-of-LoRA framework and a 3D refiner-decoder to produce high-fidelity 3D Gaussians or meshes, while enabling geometry-conditioned texture generation from text or image prompts. Key contributions include the 2.5D latent representation, the Mixture-of-LoRA architecture, and a dedicated 3D refinement pipeline, supported by a curated dataset and comprehensive evaluations showing strong texture-geometry coherence and competitive quantitative performance. This approach offers a data- and compute-efficient path to unified 3D plus texture generation with practical implications for content creation and design pipelines.
Abstract
Despite the availability of large-scale 3D datasets and advancements in 3D generative models, the complexity and uneven quality of 3D geometry and texture data continue to hinder the performance of 3D generation techniques. In most existing approaches, 3D geometry and texture are generated in separate stages using different models and non-unified representations, frequently leading to unsatisfactory coherence between geometry and texture. To address these challenges, we propose a novel framework for joint generation of 3D geometry and texture. Specifically, we focus in generate a versatile 2.5D representations that can be seamlessly transformed between 2D and 3D. Our approach begins by integrating multiview RGB, normal, and coordinate images into a unified representation, termed as 2.5D latents. Next, we adapt pre-trained 2D foundation models for high-fidelity 2.5D generation, utilizing both text and image conditions. Finally, we introduce a lightweight 2.5D-to-3D refiner-decoder framework that efficiently generates detailed 3D representations from 2.5D images. Extensive experiments demonstrate that our model not only excels in generating high-quality 3D objects with coherent structure and color from text and image inputs but also significantly outperforms existing methods in geometry-conditioned texture generation.
