DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Ziang Cao; Fangzhou Hong; Tong Wu; Liang Pan; Ziwei Liu

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Ziang Cao, Fangzhou Hong, Tong Wu, Liang Pan, Ziwei Liu

TL;DR

This work tackles large-vocabulary 3D generation by introducing DiffTF++, a diffusion-based feed-forward model that leverages a 3D-aware transformer and triplane representation to balance efficiency and generalization across diverse categories. The method comprises a two-stage pipeline (triplane fitting and diffusion training) enhanced by a 3D-aware encoder/decoder, cross-plane attention-based transformer, a 3D-aware refinement, and a multi-view reconstruction loss to align stages and suppress artifacts. Extensive experiments on ShapeNet and OmniObject3D show state-of-the-art performance in texture, topology, and overall realism, with quantitative metrics and human studies confirming improvements. The approach offers a scalable path toward high-quality, diverse 3D asset generation suitable for applications in games, robotics, and design, by effectively integrating generalized 3D priors with category-specific details through 3D-aware architectures and refinement techniques.

Abstract

Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 13 figures, 5 tables)

This paper contains 27 sections, 11 equations, 13 figures, 5 tables.

Introduction
Related Work
Transformer
3D Generation
Revisit Multi-Head Attention and DDPMs
Methodology
3D Representation Fitting
3D-aware Diffusion with transformer
3D-aware encoder/decoder
3D-aware Transformer
3D-aware Refinement and multi-view reconstruction loss
Experiments
Implementation details
Data processing
Two-stage Training
...and 12 more sections

Figures (13)

Figure 1: Comparison between DiffTF (Top) and DiffTF++ (Bottom). By introducing multi-view reconstruction loss and 3D-aware refinement, DiffTF++ can not only improve the quality of texture but reduce the artifacts in the generated 3D objects.
Figure 3: Pipeline comparison between DiffTF and DiffTF++. DiffTF (top) has two individual stages: 1) optimize the triplane features for each 3D object, and 2) train our 3D-aware diffusion model based on those fitted triplanes. Due to the gradient discontinuity between the two stages, there exists an inevitable mismatch between the objectives of diffusion and the ground truth. To handle it, we adopt a multi-view reconstruction loss (bottom) for DiffTF++ by introducing a neat 3D-aware refinement and a multi-view reconstruction loss.
Figure 4: The detailed structure of our proposed 3D-aware encoder/decoder. The 3D-aware module can efficiently encode the triplanes while maintaining the 3D-related information via a single cross-plane attention module.
Figure 5: The detailed structure of our proposed 3D-aware transformer modules. We take the feature from the xy plane $\hat{F}_{xy}$ as an example. Relying on the extracted generalizable 3D knowledge and specialized one, our model can achieve impressive adaptivity among various categories.
Figure 6: Detailed comparison between the DiffTF (without multi-view reconstruction loss function and 3D-aware refinement) and DiffTF++. It proves that our proposed modules can filter out the noise information, thereby eliminating the artifacts in generated 3D objects effectively. Furthermore, our refinement can enrich the details of the generated topology and generate high-quality 3D objects with abundant texture.
...and 8 more figures

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

TL;DR

Abstract

DiffTF++: 3D-aware Diffusion Transformer for Large-Vocabulary 3D Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (13)