Table of Contents
Fetching ...

Collaborative Multi-Modal Coding for High-Quality 3D Generation

Ziang Cao, Zhaoxi Chen, Liang Pan, Ziwei Liu

TL;DR

TriMM is presented, the first feed-forward 3D-native generative model that learns from basic multi-modalities and employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail.

Abstract

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

Collaborative Multi-Modal Coding for High-Quality 3D Generation

TL;DR

TriMM is presented, the first feed-forward 3D-native generative model that learns from basic multi-modalities and employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail.

Abstract

3D content inherently encompasses multi-modal characteristics and can be projected into different modalities (e.g., RGB images, RGBD, and point clouds). Each modality exhibits distinct advantages in 3D asset modeling: RGB images contain vivid 3D textures, whereas point clouds define fine-grained 3D geometries. However, most existing 3D-native generative architectures either operate predominantly within single-modality paradigms-thus overlooking the complementary benefits of multi-modality data-or restrict themselves to 3D structures, thereby limiting the scope of available training datasets. To holistically harness multi-modalities for 3D modeling, we present TriMM, the first feed-forward 3D-native generative model that learns from basic multi-modalities (e.g., RGB, RGBD, and point cloud). Specifically, 1) TriMM first introduces collaborative multi-modal coding, which integrates modality-specific features while preserving their unique representational strengths. 2) Furthermore, auxiliary 2D and 3D supervision are introduced to raise the robustness and performance of multi-modal coding. 3) Based on the embedded multi-modal code, TriMM employs a triplane latent diffusion model to generate 3D assets of superior quality, enhancing both the texture and the geometric detail. Extensive experiments on multiple well-known datasets demonstrate that TriMM, by effectively leveraging multi-modality, achieves competitive performance with models trained on large-scale datasets, despite utilizing a small amount of training data. Furthermore, we conduct additional experiments on recent RGB-D datasets, verifying the feasibility of incorporating other multi-modal datasets into 3D generation.

Paper Structure

This paper contains 27 sections, 10 equations, 14 figures, 8 tables.

Figures (14)

  • Figure 1: By leveraging a) Collaborative Multi-Modal Coding encoded from photometric (RGB, RGBD) and geometric (RGBD, Point Clouds) information, b) TriMM can create high-quality textured meshes within 4 seconds from a single image.
  • Figure 2: Overview of our TriMM. To extract the unique attributes of multi-modal triplanes and avoid their specific weakness, we introduce the loss_2, i.e., reconstruction loss during training. It can guide our generative model to leverage the strength of multi-modalities coding, thereby achieving promising performance in 3D modeling.
  • Figure 3: Detailed structure of our Collaborative Multi-Modal Coding. The proposed Collaborative Multi-Modal Coding can be tokenized from each of the modalities (i.e. RGB, RGBD, Point Clouds) using different encoders, shown as the three branches above. By adopting a share-weight triplane-flexicube decoder, the coding (i.e. corresponding triplanes) from different modalities collaboratively share a joint latent space.
  • Figure 4: Training pipeline of our triplane latent diffusion model. It can harness and integrate the distinctive attributes of various modalities via reconstruction loss, thereby producing 3D assets enriched with rich texture and finely detailed structures.
  • Figure 5: Similarity scores between the multi-modality triplanes and the generated triplane. The results indicate that the original multi-modality triplanes exhibit limited interdependence, whereas our generated triplane effectively integrates and leverages information from all modalities.
  • ...and 9 more figures