Zero in on Shape: A Generic 2D-3D Instance Similarity Metric learned from Synthetic Data
Maciej Janik, Niklas Gard, Anna Hilsmann, Peter Eisert
TL;DR
The paper tackles cross-modal instance retrieval by learning a generic shape similarity metric between RGB images and untextured 3D models. It introduces a siamese network that aggregates multi-view 3D views into a single shape embedding and aligns it with image embeddings using a cosine-distance contrastive loss, trained exclusively on synthetic data via Domain Randomization. The authors show that increasing synthetic-data diversity and sharing siamese weights improves zero-shot retrieval, with Top-5 performance nearly matching the instance-aware baseline. This work demonstrates practical zero-shot 2D-3D retrieval capabilities and provides guidance on synthetic-data design to bridge the domain gap. The approach has potential for scalable retrieval where new shapes are frequent.
Abstract
We present a network architecture which compares RGB images and untextured 3D models by the similarity of the represented shape. Our system is optimised for zero-shot retrieval, meaning it can recognise shapes never shown in training. We use a view-based shape descriptor and a siamese network to learn object geometry from pairs of 3D models and 2D images. Due to scarcity of datasets with exact photograph-mesh correspondences, we train our network with only synthetic data. Our experiments investigate the effect of different qualities and quantities of training data on retrieval accuracy and present insights from bridging the domain gap. We show that increasing the variety of synthetic data improves retrieval accuracy and that our system's performance in zero-shot mode can match that of the instance-aware mode, as far as narrowing down the search to the top 10% of objects.
