Table of Contents
Fetching ...

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

Zhihao Zhang, Shengcao Cao, Yu-Xiong Wang

TL;DR

TAMM tackles the data-scarce regime in 3D shape understanding by integrating multi-modal learning with a two-stage, adapter-based approach. It first tunes a CLIP Image Adapter to bridge the gap between rendered 2D images and natural images, then decouples 3D features into vision- and semantics-focused sub-spaces via Image Alignment Adapter and Text Alignment Adapter, aligning them with multi-view image and text features through a tri-modal objective. Across zero-shot, linear probing, few-shot, and real-world recognition, TAMM consistently outperforms prior CLIP-based methods, achieving state-of-the-art results on Objaverse-LVIS, ModelNet40, and ScanObjectNN, and demonstrating strong transfer to varied 3D encoders and datasets. The approach provides a practical path to scalable, generalizable 3D representations by effectively leveraging abundant 2D and language data for cross-modal pre-training.

Abstract

The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding, and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However, even though the image and language representations have been aligned by cross-modal models like CLIP, we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality. To more effectively leverage both modalities in the pre-training, we introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergistic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training. Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8\% to 50.7\%, and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1\% to 99.0\%. Project page: https://alanzhangcs.github.io/tamm-page.

TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding

TL;DR

TAMM tackles the data-scarce regime in 3D shape understanding by integrating multi-modal learning with a two-stage, adapter-based approach. It first tunes a CLIP Image Adapter to bridge the gap between rendered 2D images and natural images, then decouples 3D features into vision- and semantics-focused sub-spaces via Image Alignment Adapter and Text Alignment Adapter, aligning them with multi-view image and text features through a tri-modal objective. Across zero-shot, linear probing, few-shot, and real-world recognition, TAMM consistently outperforms prior CLIP-based methods, achieving state-of-the-art results on Objaverse-LVIS, ModelNet40, and ScanObjectNN, and demonstrating strong transfer to varied 3D encoders and datasets. The approach provides a practical path to scalable, generalizable 3D representations by effectively leveraging abundant 2D and language data for cross-modal pre-training.

Abstract

The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding, and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However, even though the image and language representations have been aligned by cross-modal models like CLIP, we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality. To more effectively leverage both modalities in the pre-training, we introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergistic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training. Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8\% to 50.7\%, and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1\% to 99.0\%. Project page: https://alanzhangcs.github.io/tamm-page.
Paper Structure (16 sections, 6 equations, 7 figures, 7 tables)

This paper contains 16 sections, 6 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Our TriAdapter Multi-Modal Learning (TAMM) significantly enhances 3D shape understanding.Left: When aligning features of 3D point clouds (P) with 2D images (I) and/or text (T), prior methods (e.g., ULIP xue2023ulip and OpenShape liu2023openshape) under-utilize the image modality, due to the overlooked or unsolved image domain gap. TAMM better exploits the image modality and brings more gains when learning from both image and text data. The results are produced by pre-training Point-BERT yu2022point on ShapeNet chang2015shapenet. Middle: Our CLIP Image Adapter (CIA) re-aligns the images rendered from 3D shapes with the text descriptions. The rendered images are inaccurately matched with text when the image features are directly extracted by CLIP, and CIA can correct the matching. Right: Our Image Alignment Adapter (IAA) and Text Alignment Adapter (TAA) decouple 3D features with complementary visual and semantic focuses. In the visualized examples, features from one single adapter are matched with classes whose appearance or semantics resemble the true class; using both adapters leads to the correct class.
  • Figure 2: Overview of TAMM.Left: In Stage 1, TAMM fine-tunes a lightweight CLIP Image Adapter (CIA) through contrastive learning and re-aligns the image features with the text features to alleviate the domain shift originated from rendered images. Contrastive learning maximizes inner products between features from corresponding text-image pairs, and reduces similarities of mismatched pairs. Middle: In Stage 2, TAMM introduces Image Alignment Adapter (IAA) and Text Alignment Adapter (TAA) to decouple 3D representations into two sub-spaces: one focusing more on visual attributes and the other for semantic understanding, ensuring a more comprehensive and effective multi-modal pre-training strategy. Right: TAMM adaptively utilizes decoupled 3D features for various downstream tasks including linear probing classification (top) and zero-shot classification (bottom), achieving more robust classification results.
  • Figure 3: Qualitative results.Top: CIA re-aligns the images rendered from 3D shapes with the text descriptions. The rendered images are inaccurately matched with text when the image features are directly extracted by CLIP, and CIA can correct the matching. Bottom: IAA and TAA decouple 3D features with complementary visual and semantic focuses. Features from one single adapter are matched with classes whose appearance or semantics resemble the true class; using both adapters leads to the correct class.
  • Figure 4: Qualitative results of text-to-point-cloud retrieval. We use TAMM to acquire the features of the given query text and retrieve the point clouds with the most similar features. The shown examples demonstrate TAMM's strong multi-modal comprehension.
  • Figure 5: Qualitative results of image-to-point-cloud retrieval. We use TAMM to acquire the features of the given query images and retrieve the point clouds with the most similar features. The shown examples demonstrate TAMM's strong multi-modal comprehension.
  • ...and 2 more figures