Table of Contents
Fetching ...

Three Things to Know about Deep Metric Learning

Yash Patel, Giorgos Tolias, Jiri Matas

TL;DR

This work tackles open-set image retrieval by aligning training with retrieval metrics through a differentiable Recall@k surrogate loss $L$ and a similarity-based mixup called SiMix. It enables large-batch training by a memory-efficient two-pass forward pass and demonstrates that extensive pre-training initializations (ImageNet-21k, CLIP, DINOv2, SWAG, DiHT) substantially boost performance. The combined approach yields state-of-the-art results across standard deep metric learning benchmarks and, with strong initialization, nearly solves several datasets. The proposed methods offer practical scalability for large models in retrieval tasks and provide guidance for hyper-parameter tuning and initialization in DML pipelines.

Abstract

This paper addresses supervised deep metric learning for open-set image retrieval, focusing on three key aspects: the loss function, mixup regularization, and model initialization. In deep metric learning, optimizing the retrieval evaluation metric, recall@k, via gradient descent is desirable but challenging due to its non-differentiable nature. To overcome this, we propose a differentiable surrogate loss that is computed on large batches, nearly equivalent to the entire training set. This computationally intensive process is made feasible through an implementation that bypasses the GPU memory limitations. Additionally, we introduce an efficient mixup regularization technique that operates on pairwise scalar similarities, effectively increasing the batch size even further. The training process is further enhanced by initializing the vision encoder using foundational models, which are pre-trained on large-scale datasets. Through a systematic study of these components, we demonstrate that their synergy enables large models to nearly solve popular benchmarks.

Three Things to Know about Deep Metric Learning

TL;DR

This work tackles open-set image retrieval by aligning training with retrieval metrics through a differentiable Recall@k surrogate loss and a similarity-based mixup called SiMix. It enables large-batch training by a memory-efficient two-pass forward pass and demonstrates that extensive pre-training initializations (ImageNet-21k, CLIP, DINOv2, SWAG, DiHT) substantially boost performance. The combined approach yields state-of-the-art results across standard deep metric learning benchmarks and, with strong initialization, nearly solves several datasets. The proposed methods offer practical scalability for large models in retrieval tasks and provide guidance for hyper-parameter tuning and initialization in DML pipelines.

Abstract

This paper addresses supervised deep metric learning for open-set image retrieval, focusing on three key aspects: the loss function, mixup regularization, and model initialization. In deep metric learning, optimizing the retrieval evaluation metric, recall@k, via gradient descent is desirable but challenging due to its non-differentiable nature. To overcome this, we propose a differentiable surrogate loss that is computed on large batches, nearly equivalent to the entire training set. This computationally intensive process is made feasible through an implementation that bypasses the GPU memory limitations. Additionally, we introduce an efficient mixup regularization technique that operates on pairwise scalar similarities, effectively increasing the batch size even further. The training process is further enhanced by initializing the vision encoder using foundational models, which are pre-trained on large-scale datasets. Through a systematic study of these components, we demonstrate that their synergy enables large models to nearly solve popular benchmarks.

Paper Structure

This paper contains 36 sections, 10 equations, 9 figures, 5 tables, 2 algorithms.

Figures (9)

  • Figure 1: Example of a training batch with a query image, two positive and two negative images. Despite not positively affecting the image ranking, large similarity changes may correspond to large loss with pair-based or triplet-based loss functions, such as contrastive or triplet loss. The proposed recall loss reflects the test-time evaluation metric and focuses on similarity changes that have a positive impact on the ranks.
  • Figure 2: A comparison between recall@k and rs@k, the proposed differentiable recall@k surrogate. Examples show a query, the ranked database images sorted according to the similarity and the corresponding values for recall@k and rs@k and their dependence on similarity score change. Note that the values of recall@k and rs@k are close. Changes to similarity and ranking in some cases may not affect the original recall@k but can affect the surrogate, with the latter having a more significant impact than the former. Similarity values of all negatives are fixed for ease of understanding. The similarity values of the positives that were changed in rows 2, 3 and 4 are underlined.
  • Figure 3: The two sigmoid functions which replace the Heaviside step function for counting the positive examples in the short-list of size $k$ (left) and for estimating the rank of examples (right).
  • Figure 4: Gradient magnitude of the sigmoid used to count the positive examples in the short-list of size $k$ versus the rank $r$ (equal to $r_\Omega(q,x)$, see (\ref{['equ:rank']})) of a positive example $x$. It shows how much a positive example is pushed towards lower ranks depending on its current rank. In the case of multiple values for $k$, the total gradient is equivalent to the sum of the separate ones.
  • Figure 5: SimMix on a toy example with 3 classes (organge, blue, purple). The batch is comprised original and virtual examples. Three kinds of pairwise similarities are computed using direct dot product similarity, (\ref{['equ:simix1']}), or (\ref{['equ:simix2']}). The RS@k is applied independently per row of the similarity matrix according to the positive/negative labels. Mixed embeddings are illustrated but are never expliticly created; all similarities are estimated directly from the original embeddings.
  • ...and 4 more figures