Table of Contents
Fetching ...

Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic Data

Maciej Janik, Niklas Gard, Anna Hilsmann, Peter Eisert

TL;DR

The paper tackles cross-modal instance retrieval by learning a generic shape similarity metric between RGB images and untextured 3D models. It introduces a siamese network that aggregates multi-view 3D views into a single shape embedding and aligns it with image embeddings using a cosine-distance contrastive loss, trained exclusively on synthetic data via Domain Randomization. The authors show that increasing synthetic-data diversity and sharing siamese weights improves zero-shot retrieval, with Top-5 performance nearly matching the instance-aware baseline. This work demonstrates practical zero-shot 2D-3D retrieval capabilities and provides guidance on synthetic-data design to bridge the domain gap. The approach has potential for scalable retrieval where new shapes are frequent.

Abstract

We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact photograph-mesh correspondences, we train our network with only synthetic data. Our experiments investigate the effect of different qualities and quantities of training data on retrieval accuracy and present insights from bridging the domain gap. We show that increasing the variety of synthetic data improves retrieval accuracy and that our system's performance in zero-shot mode can match that of the instance-aware mode, as far as narrowing down the search to the top 10% of objects.

Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic Data

TL;DR

The paper tackles cross-modal instance retrieval by learning a generic shape similarity metric between RGB images and untextured 3D models. It introduces a siamese network that aggregates multi-view 3D views into a single shape embedding and aligns it with image embeddings using a cosine-distance contrastive loss, trained exclusively on synthetic data via Domain Randomization. The authors show that increasing synthetic-data diversity and sharing siamese weights improves zero-shot retrieval, with Top-5 performance nearly matching the instance-aware baseline. This work demonstrates practical zero-shot 2D-3D retrieval capabilities and provides guidance on synthetic-data design to bridge the domain gap. The approach has potential for scalable retrieval where new shapes are frequent.

Abstract

We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact photograph-mesh correspondences, we train our network with only synthetic data. Our experiments investigate the effect of different qualities and quantities of training data on retrieval accuracy and present insights from bridging the domain gap. We show that increasing the variety of synthetic data improves retrieval accuracy and that our system's performance in zero-shot mode can match that of the instance-aware mode, as far as narrowing down the search to the top 10% of objects.

Paper Structure

This paper contains 6 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Our network retrieves an unknown 3D model from an RGB image on the sole basis of object geometry.
  • Figure 2: Network Architecture
  • Figure 3: Shape Descriptor
  • Figure 4: Photorealistic renderings (BlenderProc, left) draw a direct link with the real world, while non-realistic ones (NDDS, right) provide more background and texture variation.
  • Figure 5: Deriving a sufficient number of objects to learn generic shape similarity. For comparison we also include the retrieval rates of the best instance-aware model.