Table of Contents
Fetching ...

No Captions, No Problem: Captionless 3D-CLIP Alignment with Hard Negatives via CLIP Knowledge and LLMs

Cristian Sbrolli, Matteo Matteucci

TL;DR

The proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function, and two unsupervised methods, $I2I$ and $(I2L)^2$, which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples.

Abstract

In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, $I2I$ and $(I2L)^2$, which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared to previous methods.

No Captions, No Problem: Captionless 3D-CLIP Alignment with Hard Negatives via CLIP Knowledge and LLMs

TL;DR

The proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function, and two unsupervised methods, and , which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples.

Abstract

In this study, we explore an alternative approach to enhance contrastive text-image-3D alignment in the absence of textual descriptions for 3D objects. We introduce two unsupervised methods, and , which leverage CLIP knowledge about textual and 2D data to compute the neural perceived similarity between two 3D samples. We employ the proposed methods to mine 3D hard negatives, establishing a multimodal contrastive pipeline with hard negative weighting via a custom loss function. We train on different configurations of the proposed hard negative mining approach, and we evaluate the accuracy of our models in 3D classification and on the cross-modal retrieval benchmark, testing image-to-shape and shape-to-image retrieval. Results demonstrate that our approach, even without explicit text alignment, achieves comparable or superior performance on zero-shot and standard 3D classification, while significantly improving both image-to-shape and shape-to-image retrieval compared to previous methods.
Paper Structure (11 sections, 6 equations, 4 figures, 3 tables)

This paper contains 11 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of our proposed approach. We first precompute 3D samples similarities through our proposed neural similarity metrics, then we use the obtained scores to enhance the contrastive training with hard negatives.
  • Figure 2: Our proposed similarity metrics for 3D hard negative mining.
  • Figure 3: 3D-to-3D retrieval using our similarities on a chair sample from ShapeNet dataset.
  • Figure 4: Example of 2D-to-3D cross modal retrieval using our model. Images are from Caltech101, point clouds are from ModelNet40 and are ranked in order of cosine similarity.