Table of Contents
Fetching ...

Large-Vocabulary 3D Diffusion Model with Transformer

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu

TL;DR

This work introduces DiffTF, a diffusion-based framework for large-vocabulary 3D object generation using a revised triplane representation and a 3D-aware Transformer. By coupling a 3D-aware encoder/decoder with cross-plane attention, the model extracts generalized 3D knowledge across planes and integrates it with object-specific features to handle diverse shapes and textures. Experiments on ShapeNet and OmniObject3D show state-of-the-art performance in both 2D-rendered metrics and 3D geometry/texture metrics, validating the approach. Limitations include relatively slow triplane fitting for massive scales and potential risks in misuse for synthetic media, which the authors acknowledge alongside broader societal considerations.

Abstract

Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.

Large-Vocabulary 3D Diffusion Model with Transformer

TL;DR

This work introduces DiffTF, a diffusion-based framework for large-vocabulary 3D object generation using a revised triplane representation and a 3D-aware Transformer. By coupling a 3D-aware encoder/decoder with cross-plane attention, the model extracts generalized 3D knowledge across planes and integrates it with object-specific features to handle diverse shapes and textures. Experiments on ShapeNet and OmniObject3D show state-of-the-art performance in both 2D-rendered metrics and 3D geometry/texture metrics, validating the approach. Limitations include relatively slow triplane fitting for massive scales and potential risks in misuse for synthetic media, which the authors acknowledge alongside broader societal considerations.

Abstract

Creating diverse and high-quality 3D assets with an automatic generative model is highly desirable. Despite extensive efforts on 3D generation, most existing works focus on the generation of a single category or a few categories. In this paper, we introduce a diffusion-based feed-forward framework for synthesizing massive categories of real-world 3D objects with a single generative model. Notably, there are three major challenges for this large-vocabulary 3D generation: a) the need for expressive yet efficient 3D representation; b) large diversity in geometry and texture across categories; c) complexity in the appearances of real-world objects. To this end, we propose a novel triplane-based 3D-aware Diffusion model with TransFormer, DiffTF, for handling challenges via three aspects. 1) Considering efficiency and robustness, we adopt a revised triplane representation and improve the fitting speed and accuracy. 2) To handle the drastic variations in geometry and texture, we regard the features of all 3D objects as a combination of generalized 3D knowledge and specialized 3D features. To extract generalized 3D knowledge from diverse categories, we propose a novel 3D-aware transformer with shared cross-plane attention. It learns the cross-plane relations across different planes and aggregates the generalized 3D knowledge with specialized 3D features. 3) In addition, we devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge in the encoded triplanes for handling categories with complex appearances. Extensive experiments on ShapeNet and OmniObject3D (over 200 diverse real-world categories) convincingly demonstrate that a single DiffTF model achieves state-of-the-art large-vocabulary 3D object generation performance with large diversity, rich semantics, and high quality.
Paper Structure (22 sections, 8 equations, 16 figures, 4 tables)

This paper contains 22 sections, 8 equations, 16 figures, 4 tables.

Figures (16)

  • Figure 1: Visualization on Large-vocabulary 3D object generation. DiffTF can generate high-quality 3D objects with rich semantic information and photo-realistic RGB. Top: Visualization of the generated results. Bottom: Interpolation between generated results.
  • Figure 2: An overview of DiffTF. It consists of two 3D-aware modules: a) The 3D-aware encoder/decoder aims to enhance the 3D relations in triplanes; b) The 3D-aware transformer concentrates on extracting global generalized 3D knowledge and specialized 3D features.
  • Figure 3: The detailed structure of our proposed 3D-aware modules. We take the feature from the xy plane $\hat{F}_{xy}$ as an example. Relying on the extracted generalizable 3D knowledge and specialized one, our model can achieve impressive adaptivity among various categories.
  • Figure 4: Qualitative comparisons to the SOTA methods in terms of generated 2D images and 3D shapes on OmniObject3D. Compared with other SOTA methods, our generated results are more realistic with richer semantics.
  • Figure 5: Qualitative comparison of DiffTF against other SOTA methods on ShapeNet. It intuitively illustrates the promising performance of our method in texture and topology.
  • ...and 11 more figures