Table of Contents
Fetching ...

MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation

Zhiqi Li, Wenhuan Li, Tengfei Wang, Zhenwei Wang, Junta Wu, Haoyuan Wang, Yunhan Yang, Zehuan Huang, Yang Li, Peidong Liu, Chunchao Guo

TL;DR

<3-5 sentence high-level summary> MoCA addresses the scalability bottleneck of global attention in compositional 3D generation by introducing a Mixture-of-Components Attention with two core designs: importance-based routing that attends to a top-k subset of components with full tokens, and compression of the remaining components to preserve coarse context. Built on vecset diffusion models, MoCA adds learnable tokens and per-component latents to enable efficient inter-component communication through interleaved local and global blocks, achieving up to 32 components per training sample. The approach yields superior performance on both part-aware 3D object generation and instance-based scene generation, with strong ablations validating the routing and compression mechanisms. This work advances scalable, native compositional 3D generation, offering practical benefits for fine-grained asset creation and editing workflows.

Abstract

Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: https://lizhiqi49.github.io/MoCA

MoCA: Mixture-of-Components Attention for Scalable Compositional 3D Generation

TL;DR

<3-5 sentence high-level summary> MoCA addresses the scalability bottleneck of global attention in compositional 3D generation by introducing a Mixture-of-Components Attention with two core designs: importance-based routing that attends to a top-k subset of components with full tokens, and compression of the remaining components to preserve coarse context. Built on vecset diffusion models, MoCA adds learnable tokens and per-component latents to enable efficient inter-component communication through interleaved local and global blocks, achieving up to 32 components per training sample. The approach yields superior performance on both part-aware 3D object generation and instance-based scene generation, with strong ablations validating the routing and compression mechanisms. This work advances scalable, native compositional 3D generation, offering practical benefits for fine-grained asset creation and editing workflows.

Abstract

Compositionality is critical for 3D object and scene generation, but existing part-aware 3D generation methods suffer from poor scalability due to quadratic global attention costs when increasing the number of components. In this work, we present MoCA, a compositional 3D generative model with two key designs: (1) importance-based component routing that selects top-k relevant components for sparse global attention, and (2) unimportant components compression that preserve contextual priors of unselected components while reducing computational complexity of global attention. With these designs, MoCA enables efficient, fine-grained compositional 3D asset creation with scalable number of components. Extensive experiments show MoCA outperforms baselines on both compositional object and scene generation tasks. Project page: https://lizhiqi49.github.io/MoCA

Paper Structure

This paper contains 44 sections, 5 equations, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Overview of MoCA. Our DiT model starts with packing each component's latents using several learnable queries through a cross-attention layer. Random ID embeddings are applied to distinguish different components. Then, each component's full latents and compressed version are fed into our DiT model, which is comprised with interleaved local attention blocks and our proposed Mixture-of-Components Attention blocks. Finally, the clean latents of all components are separately decoded to the global space by a frozen shape decoder to form the final 3D asset.
  • Figure 2: Illustration of Mixture-of-Components Attention. The calculation stream for component $\mathbf{c}_i$ is highlighted. This procedure is permutation-invariant across all components.
  • Figure 3: Qualitative comparison for part-composed object generation. PartPacker can not control the number of generated parts and tends to generate coarse-grained decomposition. PartCrafter suffers from poor surface quality and large-area floaters on complex composition. We run PartCrafter with the same part number configuration as ours.
  • Figure 4: Qualitative results on real-world images.
  • Figure 5: Complex scene ($>$16 instances) generation. MoCA has the expertise to generate complex scene from a single image, which capacity previous scene generation methods do not possess.
  • ...and 4 more figures