Table of Contents
Fetching ...

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

Le Xue, Ning Yu, Shu Zhang, Artemis Panagopoulou, Junnan Li, Roberto Martín-Martín, Jiajun Wu, Caiming Xiong, Ran Xu, Juan Carlos Niebles, Silvio Savarese

TL;DR

ULIP-2 tackles the scalability bottleneck of multimodal 3D learning by automatically generating holistic language descriptions from rendered views of 3D shapes using large multimodal models, enabling tri-modal pre-training without manual annotations. It aligns 3D point clouds, 2D images, and language within a frozen OpenCLIP space via dual contrastive losses, and scales both vision-language and 3D backbones to improve performance. Empirical results on Objaverse-LVIS, ModelNet40, and ScanObjectNN demonstrate state-of-the-art zero-shot and strong fine-tuned performance, along with notable gains in 3D-to-language captioning. The work also releases ULIP-Objaverse and ULIP-ShapeNet triplets, illustrating a practical path to scalable, annotation-free multimodal 3D representation learning.

Abstract

Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes, their 2D counterparts, and language descriptions. However, the methods used by existing frameworks to curate such multimodal data, in particular language descriptions for 3D shapes, are not scalable, and the collected language descriptions are not diverse. To address this, we introduce ULIP-2, a simple yet effective tri-modal pre-training framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multimodal representation learning. We conduct experiments on two large-scale 3D datasets, Objaverse and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, images, and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with fine-tuning, and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top-1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning, ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP.

ULIP-2: Towards Scalable Multimodal Pre-training for 3D Understanding

TL;DR

ULIP-2 tackles the scalability bottleneck of multimodal 3D learning by automatically generating holistic language descriptions from rendered views of 3D shapes using large multimodal models, enabling tri-modal pre-training without manual annotations. It aligns 3D point clouds, 2D images, and language within a frozen OpenCLIP space via dual contrastive losses, and scales both vision-language and 3D backbones to improve performance. Empirical results on Objaverse-LVIS, ModelNet40, and ScanObjectNN demonstrate state-of-the-art zero-shot and strong fine-tuned performance, along with notable gains in 3D-to-language captioning. The work also releases ULIP-Objaverse and ULIP-ShapeNet triplets, illustrating a practical path to scalable, annotation-free multimodal 3D representation learning.

Abstract

Recent advancements in multimodal pre-training have shown promising efficacy in 3D representation learning by aligning multimodal features across 3D shapes, their 2D counterparts, and language descriptions. However, the methods used by existing frameworks to curate such multimodal data, in particular language descriptions for 3D shapes, are not scalable, and the collected language descriptions are not diverse. To address this, we introduce ULIP-2, a simple yet effective tri-modal pre-training framework that leverages large multimodal models to automatically generate holistic language descriptions for 3D shapes. It only needs 3D data as input, eliminating the need for any manual 3D annotations, and is therefore scalable to large datasets. ULIP-2 is also equipped with scaled-up backbones for better multimodal representation learning. We conduct experiments on two large-scale 3D datasets, Objaverse and ShapeNet, and augment them with tri-modal datasets of 3D point clouds, images, and language for training ULIP-2. Experiments show that ULIP-2 demonstrates substantial benefits in three downstream tasks: zero-shot 3D classification, standard 3D classification with fine-tuning, and 3D captioning (3D-to-language generation). It achieves a new SOTA of 50.6% (top-1) on Objaverse-LVIS and 84.7% (top-1) on ModelNet40 in zero-shot classification. In the ScanObjectNN benchmark for standard fine-tuning, ULIP-2 reaches an overall accuracy of 91.5% with a compact model of only 1.4 million parameters. ULIP-2 sheds light on a new paradigm for scalable multimodal 3D representation learning without human annotations and shows significant improvements over existing baselines. The code and datasets are released at https://github.com/salesforce/ULIP.
Paper Structure (23 sections, 3 equations, 3 figures, 11 tables)

This paper contains 23 sections, 3 equations, 3 figures, 11 tables.

Figures (3)

  • Figure 1: Overview of the ULIP-2 pre-training framework and its downstream tasks. The above part is the ULIP-2 pre-training framework, ULIP-2 employs a large multimodal model to automatically generate detailed descriptions for each 2D-rendered image from holistic viewpoints of a 3D shape. ULIP-2 takes advantage of a pre-aligned and frozen vision-language feature space to achieve alignment among the triplet modalities: holistic texts, images, and 3D point clouds. After the pre-training, the 3D encoder will be used in the downstream tasks. As shown in the figure, only the 3D data is required for this pre-training process.
  • Figure 2: An illustration of language description generation from 2D images. These images are rendered from a set of holistic viewpoints of a 3D object. In some views, the chair is not visible, while in other views, the scepter/sword cannot be seen. Combining descriptions of all views is essential for the model to learn comprehensive and holistic information about the 3D object. From the metadata, the manual caption for this object is “Estatua de Alfonso X - José Alcoverro (1892)“, which doesn’t include much semantic information and could potentially harm the multimodal pre-training, unlike ULIP-2's holistic captions.
  • Figure 3: 3D-to-language multimodal generation using X-InstructBLIP framework xinstructblip.