Table of Contents
Fetching ...

AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

Lingting Zhu, Shengju Qian, Haidi Fan, Jiayu Dong, Zhenchao Jin, Siwei Zhou, Gen Dong, Xin Wang, Lequan Yu

TL;DR

AssetFormer presents a decoder-only Transformer framework for generating modular 3D assets composed of discrete primitives from textual prompts. By employing discrete tokenization, DFS/BFS token ordering, classifier-free guidance, and a SlowFast decoding scheme, it achieves higher quality and efficiency than traditional PCG or mesh-based 3D generation on a real-world UGC dataset (16k real + 4k synthetic, 25 primitive types). Key findings include the importance of token order and data-source diversity, the practicality of a modular representation for production pipelines, and notable speedups in autoregressive decoding without sacrificing fidelity. The work offers a practical pathway for text-conditioned modular asset generation with broad implications for UGC platforms and game development.

Abstract

The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.

AssetFormer: Modular 3D Assets Generation with Autoregressive Transformer

TL;DR

AssetFormer presents a decoder-only Transformer framework for generating modular 3D assets composed of discrete primitives from textual prompts. By employing discrete tokenization, DFS/BFS token ordering, classifier-free guidance, and a SlowFast decoding scheme, it achieves higher quality and efficiency than traditional PCG or mesh-based 3D generation on a real-world UGC dataset (16k real + 4k synthetic, 25 primitive types). Key findings include the importance of token order and data-source diversity, the practicality of a modular representation for production pipelines, and notable speedups in autoregressive decoding without sacrificing fidelity. The work offers a practical pathway for text-conditioned modular asset generation with broad implications for UGC platforms and game development.

Abstract

The digital industry demands high-quality, diverse modular 3D assets, especially for user-generated content~(UGC). In this work, we introduce AssetFormer, an autoregressive Transformer-based model designed to generate modular 3D assets from textual descriptions. Our pilot study leverages real-world modular assets collected from online platforms. AssetFormer tackles the challenge of creating assets composed of primitives that adhere to constrained design parameters for various applications. By innovatively adapting module sequencing and decoding techniques inspired by language models, our approach enhances asset generation quality through autoregressive modeling. Initial results indicate the effectiveness of AssetFormer in streamlining asset creation for professional development and UGC scenarios. This work presents a flexible framework extendable to various types of modular 3D assets, contributing to the broader field of 3D content generation. The code is available at https://github.com/Advocate99/AssetFormer.
Paper Structure (24 sections, 4 equations, 14 figures, 7 tables, 2 algorithms)

This paper contains 24 sections, 4 equations, 14 figures, 7 tables, 2 algorithms.

Figures (14)

  • Figure 1: Illustration of modular 3D assets. Modular assets can be decomposed into primitives, each possessing its own attributes, e.g., the orientation $r$ and the position $\bm x$. The modular asset can be rendered with configurations to enable 3D deployment.
  • Figure 2: Overview of the AssetFormer Framework. Given the modular assets, e.g., the building, we first render the assets in digital engines and produce the images for querying GPT-4o. The cleaned captions, pre-filled with a re-ordered token set, serve as input for the autoregressive modeling. After training, AssetFormer autoregressively produces modular assets that are ready to be integrated into industrial environments, with model-based enhancement and application-driven deployment.
  • Figure 3: Qualitative comparison with comparison methods. (a) While PCG can synthesize high-quality building models, it requires meticulous algorithm design for complex buildings and can only produce simple assets that are difficult to control with text. (b) Compared with 3D generation methods, which typically yield dense meshes, struggle to accurately capture intricate geometries (the internal structure of buildings), and produce imperfect textures, our methods follow the design rationales of preferred rules (e.g., with standard primitives of plain faces) and deliver precise texture in real-world pipelines with primitive-texture mapping.
  • Figure 4: Qualitative ablation analysis. (a) Ablation on token orders. With improper token order, the model struggles to fit and generate the distribution accurately. (b) Ablation on data sources. The models fail to cover a wide range of diverse building types and exhibits a higher ratio of failure cases when trained on a single data source. The artifacts are indicated in red rectangles.
  • Figure 5: Qualitative analysis on fine-tuning native 3D generative models. (a) After Watertight conversion, the modular information is lost and the geometry erroneous (e.g., the ladder). (b) The geometry details are actually changed (zoom in to see the vertices and faces). (c) The fine-tuned Hunyuan3D 2.1 produces an overall inferior assets and (d) the details are poor.
  • ...and 9 more figures