TAMM: TriAdapter Multi-Modal Learning for 3D Shape Understanding
Zhihao Zhang, Shengcao Cao, Yu-Xiong Wang
TL;DR
TAMM tackles the data-scarce regime in 3D shape understanding by integrating multi-modal learning with a two-stage, adapter-based approach. It first tunes a CLIP Image Adapter to bridge the gap between rendered 2D images and natural images, then decouples 3D features into vision- and semantics-focused sub-spaces via Image Alignment Adapter and Text Alignment Adapter, aligning them with multi-view image and text features through a tri-modal objective. Across zero-shot, linear probing, few-shot, and real-world recognition, TAMM consistently outperforms prior CLIP-based methods, achieving state-of-the-art results on Objaverse-LVIS, ModelNet40, and ScanObjectNN, and demonstrating strong transfer to varied 3D encoders and datasets. The approach provides a practical path to scalable, generalizable 3D representations by effectively leveraging abundant 2D and language data for cross-modal pre-training.
Abstract
The limited scale of current 3D shape datasets hinders the advancements in 3D shape understanding, and motivates multi-modal learning approaches which transfer learned knowledge from data-abundant 2D image and language modalities to 3D shapes. However, even though the image and language representations have been aligned by cross-modal models like CLIP, we find that the image modality fails to contribute as much as the language in existing multi-modal 3D representation learning methods. This is attributed to the domain shift in the 2D images and the distinct focus of each modality. To more effectively leverage both modalities in the pre-training, we introduce TriAdapter Multi-Modal Learning (TAMM) -- a novel two-stage learning approach based on three synergistic adapters. First, our CLIP Image Adapter mitigates the domain gap between 3D-rendered images and natural images, by adapting the visual representations of CLIP for synthetic image-text pairs. Subsequently, our Dual Adapters decouple the 3D shape representation space into two complementary sub-spaces: one focusing on visual attributes and the other for semantic understanding, which ensure a more comprehensive and effective multi-modal pre-training. Extensive experiments demonstrate that TAMM consistently enhances 3D representations for a wide range of 3D encoder architectures, pre-training datasets, and downstream tasks. Notably, we boost the zero-shot classification accuracy on Objaverse-LVIS from 46.8\% to 50.7\%, and improve the 5-way 10-shot linear probing classification accuracy on ModelNet40 from 96.1\% to 99.0\%. Project page: https://alanzhangcs.github.io/tamm-page.
