Table of Contents
Fetching ...

Efficient and Discriminative Image Feature Extraction for Universal Image Retrieval

Morris Florek, David Tschirschwitz, Björn Barz, Volker Rodehorst

TL;DR

The paper tackles domain generalization in image retrieval by proposing a resource-efficient framework to learn a universal image encoder for instance-level retrieval. It introduces M4D-35k, a compact multi-domain, instance-labeled training set designed for efficient training, and demonstrates that training only a projection head (linear probing) on top of a pre-trained visual-semantic backbone can yield competitive universal embeddings. Through extensive evaluation of multiple foundation models and margin-based metric learning losses, the authors identify effective combinations (e.g., SigLIP-SoViT-400m with Sub-Center ArcFace) that achieve near-SOTA performance on the Google Universal Image Embedding Challenge with $mMP@5=0.721$, while using $32 ext{ extminus}percent$ fewer total parameters and $289 imes$ fewer trainable parameters than end-to-end fine-tuned rivals. The work emphasizes careful data curation and efficient training strategies, delivering reproducible results and providing code and M4D-35k annotations to the community; future directions include evaluating on the UnED dataset and exploring similar resource-efficient pipelines on broader benchmarks.

Abstract

Current image retrieval systems often face domain specificity and generalization issues. This study aims to overcome these limitations by developing a computationally efficient training framework for a universal feature extractor that provides strong semantic image representations across various domains. To this end, we curated a multi-domain training dataset, called M4D-35k, which allows for resource-efficient training. Additionally, we conduct an extensive evaluation and comparison of various state-of-the-art visual-semantic foundation models and margin-based metric learning loss functions regarding their suitability for efficient universal feature extraction. Despite constrained computational resources, we achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721. This places our method at the second rank on the leaderboard, just 0.7 percentage points behind the best performing method. However, our model has 32% fewer overall parameters and 289 times fewer trainable parameters. Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points. We release our code and M4D-35k training set annotations at https://github.com/morrisfl/UniFEx.

Efficient and Discriminative Image Feature Extraction for Universal Image Retrieval

TL;DR

The paper tackles domain generalization in image retrieval by proposing a resource-efficient framework to learn a universal image encoder for instance-level retrieval. It introduces M4D-35k, a compact multi-domain, instance-labeled training set designed for efficient training, and demonstrates that training only a projection head (linear probing) on top of a pre-trained visual-semantic backbone can yield competitive universal embeddings. Through extensive evaluation of multiple foundation models and margin-based metric learning losses, the authors identify effective combinations (e.g., SigLIP-SoViT-400m with Sub-Center ArcFace) that achieve near-SOTA performance on the Google Universal Image Embedding Challenge with , while using fewer total parameters and fewer trainable parameters than end-to-end fine-tuned rivals. The work emphasizes careful data curation and efficient training strategies, delivering reproducible results and providing code and M4D-35k annotations to the community; future directions include evaluating on the UnED dataset and exploring similar resource-efficient pipelines on broader benchmarks.

Abstract

Current image retrieval systems often face domain specificity and generalization issues. This study aims to overcome these limitations by developing a computationally efficient training framework for a universal feature extractor that provides strong semantic image representations across various domains. To this end, we curated a multi-domain training dataset, called M4D-35k, which allows for resource-efficient training. Additionally, we conduct an extensive evaluation and comparison of various state-of-the-art visual-semantic foundation models and margin-based metric learning loss functions regarding their suitability for efficient universal feature extraction. Despite constrained computational resources, we achieve near state-of-the-art results on the Google Universal Image Embedding Challenge, with a mMP@5 of 0.721. This places our method at the second rank on the leaderboard, just 0.7 percentage points behind the best performing method. However, our model has 32% fewer overall parameters and 289 times fewer trainable parameters. Compared to methods with similar computational requirements, we outperform the previous state of the art by 3.3 percentage points. We release our code and M4D-35k training set annotations at https://github.com/morrisfl/UniFEx.
Paper Structure (21 sections, 3 equations, 5 figures, 11 tables)

This paper contains 21 sections, 3 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Results on the GUIEC araujo_google_2022 test set. Comparing our approach to the GUIEC leaderboard by plotting the evaluation metric ($mMP@5$) over the number of total model parameters. The bubble’s area is proportional to the number of trainable model parameters.
  • Figure 1: Margin mapping function $f(n)$ with $m_{\text{max}} = 0.6$ and $m_{\text{min}} = 0.2$
  • Figure 2: The table on the left displays the datasets considered for the curated M4D-35k. Datasets are ranked according to their frequency of use in the GUIEC araujo_google_2022, as measured by the $mAP$ relative to the GUIEC leaderboard rank. The curation process is shown on the right.
  • Figure 2: Overview of the SAM kirillov_segment_2023 image encoder and the layers from which the embeddings were extracted.
  • Figure 3: The embedding model consists of a visual-semantic foundation model as backbone, followed by a projection head. During training, a margin-based metric learning loss is employed, with cosine similarities $\cos(\theta)$ derived via matrix multiplication from the normalized embeddings x and weights W. An angular margin m is added to the target angle $\theta_{y_i}$, logits are scaled by the scaling parameter s, and both softmax activation and cross-entropy loss are applied. The model’s trainable and non-trainable components are also detailed.