Table of Contents
Fetching ...

A Metric Learning Reality Check

Kevin Musgrave, Serge Belongie, Ser-Nam Lim

TL;DR

The paper questions whether advances in deep metric learning have reflected real progress or been inflated by flawed evaluation practices. It identifies major issues—unfair comparisons, weak and potentially misleading metrics, and training with test-set feedback—and proposes a fair, reproducible protocol with MAP@R and cross-validated hyperparameter tuning. By re-evaluating a broad set of losses under this protocol, it shows that improvements over classic baselines are modest and often indistinguishable when properly controlled. The work argues for more rigorous methodology and suggests that reported gains may reflect hyperparameter choices rather than fundamental advances, with implications for open-set retrieval and broader ML research including self-supervised approaches.

Abstract

Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental methodology of numerous metric learning papers, and show that the actual improvements over time have been marginal at best.

A Metric Learning Reality Check

TL;DR

The paper questions whether advances in deep metric learning have reflected real progress or been inflated by flawed evaluation practices. It identifies major issues—unfair comparisons, weak and potentially misleading metrics, and training with test-set feedback—and proposes a fair, reproducible protocol with MAP@R and cross-validated hyperparameter tuning. By re-evaluating a broad set of losses under this protocol, it shows that improvements over classic baselines are modest and often indistinguishable when properly controlled. The work argues for more rigorous methodology and suggests that reported gains may reflect hyperparameter choices rather than fundamental advances, with implications for open-set retrieval and broader ML research including self-supervised approaches.

Abstract

Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental methodology of numerous metric learning papers, and show that the actual improvements over time have been marginal at best.

Paper Structure

This paper contains 21 sections, 2 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: How different accuracy metrics score on three toy examples.
  • Figure 2: Papers versus Reality: the trend of Precision@1 of various methods over the years. In a), the baseline methods have * next to them, which indicates that their numbers are the average reported accuracy from all papers that included those baselines.
  • Figure 3: Papers versus Reality: we look at the results tables of all methods presented in Figure \ref{['PaperClaimsOverTime:1']}. 11 of these include the contrastive loss, and 12 include the triplet loss (without semihard mining). For each paper, we compute the relative percentage improvement of their proposed method over their reported result for the contrastive or triplet loss, and then take the average improvement across papers (grey bars in the above figures). The green and red bars are the average relative improvement that we obtain, in the separated 128-dim and concatenated 512-dim settings, respectively. For the "reality" numbers in (a) we excluded the FastAP loss from the calculation, since it was a poor performing outlier in our experiments, and we excluded the triplet loss since we consider it a baseline method. Likewise for the "reality" numbers in (b), we excluded the FastAP and contrastive losses from the calculation.