Table of Contents
Fetching ...

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

Paul Julius Kühn, Cedric Spengler, Michael Weinmann, Arjan Kuijper, Saptarshi Neil Sinha

TL;DR

This work proposes the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors.

Abstract

Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on $Acc_{Top1}$ and $Acc_{Top10}$ for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.

Optimizing Multi-Modal Models for Image-Based Shape Retrieval: The Role of Pre-Alignment and Hard Contrastive Learning

TL;DR

This work proposes the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors.

Abstract

Image-based shape retrieval (IBSR) aims to retrieve 3D models from a database given a query image, hence addressing a classical task in computer vision, computer graphics, and robotics. Recent approaches typically rely on bridging the domain gap between 2D images and 3D shapes based on the use of multi-view renderings as well as task-specific metric learning to embed shapes and images into a common latent space. In contrast, we address IBSR through large-scale multi-modal pretraining and show that explicit view-based supervision is not required. Inspired by pre-aligned image--point-cloud encoders from ULIP and OpenShape that have been used for tasks such as 3D shape classification, we propose the use of pre-aligned image and shape encoders for zero-shot and standard IBSR by embedding images and point clouds into a shared representation space and performing retrieval via similarity search over compact single-embedding shape descriptors. This formulation allows skipping view synthesis and naturally enables zero-shot and cross-domain retrieval without retraining on the target database. We evaluate pre-aligned encoders in both zero-shot and supervised IBSR settings and additionally introduce a multi-modal hard contrastive loss (HCL) to further increase retrieval performance. Our evaluation demonstrates state-of-the-art performance, outperforming related methods on and for shape retrieval across multiple datasets, with best results observed for OpenShape combined with Point-BERT. Furthermore, training on our proposed multi-modal HCL yields dataset-dependent gains in standard instance retrieval tasks on shape-centric data, underscoring the value of pretraining and hard contrastive learning for 3D shape retrieval. The code will be made available via the project website.
Paper Structure (17 sections, 4 equations, 5 figures, 5 tables)

This paper contains 17 sections, 4 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Training pipeline with multiple modalities. Zero-shot retrieval (a) uses pre-aligned image- and shape-encoders trained with multi-modal contrastive lossesxue2023ulipulip2liu2023openshape. Training or fine-tuning (b) of image- and shape-encoders using either InfoNCE or hard contrastive learning enables standard retrieval (c)
  • Figure 2: Comparison of (a) random sampling vs. (b) hard negative sampling robinson2021hcl. Random sampling may yield out-of-class samples (e.g., couch $\bigtriangleup$ vs. airplane $\bigcirc$), producing overly easy negatives or even false negatives (a), while hard negative sampling ensures true hard negatives close to the anchor $\pmb{\bigcirc}$(b)(inspired by Fig. 1 of robinson2021hcl).
  • Figure 3: Von Mises-Fisher distribution on a unit hypersphere for varying $\beta$. Higher $\beta$ increases concentration of $q_\beta$ and negative hardness.
  • Figure 4: Comparison of pre-training and loss function effects on instance retrieval.
  • Figure 5: Qualitative results a given query image ($I_g$) of the IKEA_EKTORP_2 sofa from sun2018pix3d comparing retrieved by the scaled Point-BERT yu2021pointbert model by OpenShapeliu2023openshape in its pre-aligned version (a), after fine-tuning on Pix3D sun2018pix3d using the InfoNCE oord2018infonce loss (b) and after fine-tuning using our multi-modal HCL loss (c).