A Metric Learning Reality Check
Kevin Musgrave, Serge Belongie, Ser-Nam Lim
TL;DR
The paper questions whether advances in deep metric learning have reflected real progress or been inflated by flawed evaluation practices. It identifies major issues—unfair comparisons, weak and potentially misleading metrics, and training with test-set feedback—and proposes a fair, reproducible protocol with MAP@R and cross-validated hyperparameter tuning. By re-evaluating a broad set of losses under this protocol, it shows that improvements over classic baselines are modest and often indistinguishable when properly controlled. The work argues for more rigorous methodology and suggests that reported gains may reflect hyperparameter choices rather than fundamental advances, with implications for open-set retrieval and broader ML research including self-supervised approaches.
Abstract
Deep metric learning papers from the past four years have consistently claimed great advances in accuracy, often more than doubling the performance of decade-old methods. In this paper, we take a closer look at the field to see if this is actually true. We find flaws in the experimental methodology of numerous metric learning papers, and show that the actual improvements over time have been marginal at best.
