Efficient and Discriminative Image Feature Extraction for Universal Image Retrieval
Morris Florek, David Tschirschwitz, Björn Barz, Volker Rodehorst
TL;DR
The paper tackles domain generalization in image retrieval by proposing a resource-efficient framework to learn a universal image encoder for instance-level retrieval. It introduces M4D-35k, a compact multi-domain, instance-labeled training set designed for efficient training, and demonstrates that training only a projection head (linear probing) on top of a pre-trained visual-semantic backbone can yield competitive universal embeddings. Through extensive evaluation of multiple foundation models and margin-based metric learning losses, the authors identify effective combinations (e.g., SigLIP-SoViT-400m with Sub-Center ArcFace) that achieve near-SOTA performance on the Google Universal Image Embedding Challenge with $mMP@5=0.721$, while using $32 ext{ extminus}percent$ fewer total parameters and $289 imes$ fewer trainable parameters than end-to-end fine-tuned rivals. The work emphasizes careful data curation and efficient training strategies, delivering reproducible results and providing code and M4D-35k annotations to the community; future directions include evaluating on the UnED dataset and exploring similar resource-efficient pipelines on broader benchmarks.
Abstract
Current image retrieval systems often face domain specificity and generalization issues. This study aims to overcome these limitations by developing a computationally efficient training framework for a universal feature extractor that provides strong semantic image representations across various domains. To this end, we curated a multi-domain training dataset, called M4D-35k, which allows for resource-efficient training. Additionally, we conduct an extensive evaluation and comparison of various state-of-the-art visual-semantic foundation models and margin-based metric learning loss functions regarding their suitability for efficient universal feature extraction. Despite constrained computational resources, we achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721. This places our method at the second rank on the leaderboard, just 0.7 percentage points behind the best performing method. However, our model has 32% fewer overall parameters and 289 times fewer trainable parameters. Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points. We release our code and M4D-35k training set annotations at https://github.com/morrisfl/UniFEx.
