Table of Contents
Fetching ...

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

Zhaoxi Chen, Jiaxiang Tang, Yuhao Dong, Ziang Cao, Fangzhou Hong, Yushi Lan, Tengfei Wang, Haozhe Xie, Tong Wu, Shunsuke Saito, Liang Pan, Dahua Lin, Ziwei Liu

TL;DR

This work introduces PrimX, a primitive-based 3D representation encoded as an efficient N×D tensor that jointly models shape, texture, and material on a textured mesh. Built on PrimX, the authors develop Latent Primitive Diffusion, a Transformer-based diffusion model operating on latent per-primitive tokens, augmented by a 3D VAE for local patch compression to enable scalable, high-resolution 3D generation. The approach supports text-to-3D and image-to-3D generation with high-fidelity geometry and physically based rendering (PBR) materials, and includes robust mesh-to-PrimX fitting and PrimX-to-mesh extraction pipelines for practical GLB outputs. Empirical results show PrimX outperforms baselines in geometry quality, renderable textures, and material realism, while scaling effectively with model size and primitive counts, and enabling inpainting/interpolation capabilities.

Abstract

The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. 3DTopia-XL leverages a novel primitive-based 3D representation, PrimX, which encodes detailed shape, albedo, and material field into a compact tensorial format, facilitating the modeling of high-resolution geometry with PBR assets. On top of the novel representation, we propose a generative framework based on Diffusion Transformer (DiT), which comprises 1) Primitive Patch Compression, 2) and Latent Primitive Diffusion. 3DTopia-XL learns to generate high-quality 3D assets from textual or visual inputs. We conduct extensive qualitative and quantitative experiments to demonstrate that 3DTopia-XL significantly outperforms existing methods in generating high-quality 3D assets with fine-grained textures and materials, efficiently bridging the quality gap between generative models and real-world applications.

3DTopia-XL: Scaling High-quality 3D Asset Generation via Primitive Diffusion

TL;DR

This work introduces PrimX, a primitive-based 3D representation encoded as an efficient N×D tensor that jointly models shape, texture, and material on a textured mesh. Built on PrimX, the authors develop Latent Primitive Diffusion, a Transformer-based diffusion model operating on latent per-primitive tokens, augmented by a 3D VAE for local patch compression to enable scalable, high-resolution 3D generation. The approach supports text-to-3D and image-to-3D generation with high-fidelity geometry and physically based rendering (PBR) materials, and includes robust mesh-to-PrimX fitting and PrimX-to-mesh extraction pipelines for practical GLB outputs. Empirical results show PrimX outperforms baselines in geometry quality, renderable textures, and material realism, while scaling effectively with model size and primitive counts, and enabling inpainting/interpolation capabilities.

Abstract

The increasing demand for high-quality 3D assets across various industries necessitates efficient and automated 3D content creation. Despite recent advancements in 3D generative models, existing methods still face challenges with optimization speed, geometric fidelity, and the lack of assets for physically based rendering (PBR). In this paper, we introduce 3DTopia-XL, a scalable native 3D generative model designed to overcome these limitations. 3DTopia-XL leverages a novel primitive-based 3D representation, PrimX, which encodes detailed shape, albedo, and material field into a compact tensorial format, facilitating the modeling of high-resolution geometry with PBR assets. On top of the novel representation, we propose a generative framework based on Diffusion Transformer (DiT), which comprises 1) Primitive Patch Compression, 2) and Latent Primitive Diffusion. 3DTopia-XL learns to generate high-quality 3D assets from textual or visual inputs. We conduct extensive qualitative and quantitative experiments to demonstrate that 3DTopia-XL significantly outperforms existing methods in generating high-quality 3D assets with fine-grained textures and materials, efficiently bridging the quality gap between generative models and real-world applications.
Paper Structure (52 sections, 12 equations, 19 figures, 8 tables)

This paper contains 52 sections, 12 equations, 19 figures, 8 tables.

Figures (19)

  • Figure 1: 3DTopia-XL generates high-quality 3D assets with smooth geometry and spatially varied textures and materials. The output asset (GLB mesh) can be seamlessly ported into graphics engines for physically-based rendering.
  • Figure 2: Illustration of PrimX. We propose to represent the 3D shape, texture, and material of a textured mesh as a compact $N\times D$ tensor (Sec. \ref{['sec:primx-def']}). We anchor $N$ primitives to the positions sampled on the mesh surface. Each primitive ${\mathcal{V}}_k$ is a tiny voxel with a resolution of $a^3$, parameterized by its 3D position ${\mathbf{t}}_k \in {\mathbb{R}}^3$, a global scale factor $s_k \in {\mathbb{R}}^{+}$, and corresponding spatially varied payload ${\bm{X}}_k \in {\mathbb{R}}^{a\times a\times a\times 6}$ for SDF, RGB, and material. This tensorial representation can be rapidly computed from a textured mesh within 1.5 minutes (Sec. \ref{['sec:fitting']}).
  • Figure 3: Overview of 3DTopia-XL. As a native 3D diffusion model, 3DTopia-XL is built upon a novel 3D representation PrimX (Sec. \ref{['sec:primx']}). This compact and expressive representation encodes the shape, texture, and material of a textured mesh efficiently, which allows modeling high-resolution geometry with PBR assets. Furthermore, this tensorial representation facilitates our patch-based compression using primitive patch VAE (Sec. \ref{['sec:vae']}). We then use our novel latent primitive diffusion (Sec. \ref{['sec:lpd']}) for 3D generative modeling, which operates the diffusion and denoising process on the set of latent PrimX, naturally compatible with Transformer-based neural architectures.
  • Figure 4: Evaluations of different 3D representations. We evaluate the effectiveness of different representations in fitting the ground truth's shape, texture, and material (right). All representations are constrained to a budget of 1.05M parameters. PrimX achieves the highest fidelity in terms of geometry and appearance with significant strength in runtime efficiency (Table \ref{['tab:repr-eval']}) at the same time.
  • Figure 5: Image-to-3D comparisons. For each method, we take the textured mesh predicted from the input image into Blender and render it with the target environment map. We compare our single-view conditioned model with sparse-view reconstruction models and image-conditioned diffusion models. 3DTopia-XL achieves the best visual and geometry quality. Thanks to our capability to generate spatially varied PBR assets shown on the rightmost, our generated mesh can also produce vivid reflectance with specular highlights and glossiness.
  • ...and 14 more figures