Table of Contents
Fetching ...

All You Need to Know About Training Image Retrieval Models

Gabriele Berton, Kevin Musgrave, Carlo Masone

TL;DR

This work systematically analyzes how training-time factors such as loss functions, optimizers, sampling, learning rates, batch sizes, and labeling strategies affect image retrieval performance across multiple datasets, addressing a gap in understanding component interactions. Through tens of thousands of training runs, it reveals practical guidelines: use CLS features from DINO-v2 with full fine-tuning and a model LR of $1e-6$, apply large batches for contrastive losses with online miners, and tune the classifier LR separately (around $1$); in constrained settings, CosFace/ArcFace with 1 image per class can be preferable. The study finds that metric-learning losses tolerate some label noise, but more classes and larger datasets generally boost accuracy, suggesting data quantity over meticulous labeling; feature dimensionality and sampling choices also impact performance and memory use. Together, these insights offer actionable strategies for building efficient, scalable image retrieval systems under diverse compute budgets, balancing accuracy, resources, and data collection strategies.

Abstract

Image retrieval is the task of finding images in a database that are most similar to a given query image. The performance of an image retrieval pipeline depends on many training-time factors, including the embedding model architecture, loss function, data sampler, mining function, learning rate(s), and batch size. In this work, we run tens of thousands of training runs to understand the effect each of these factors has on retrieval accuracy. We also discover best practices that hold across multiple datasets. The code is available at https://github.com/gmberton/image-retrieval

All You Need to Know About Training Image Retrieval Models

TL;DR

This work systematically analyzes how training-time factors such as loss functions, optimizers, sampling, learning rates, batch sizes, and labeling strategies affect image retrieval performance across multiple datasets, addressing a gap in understanding component interactions. Through tens of thousands of training runs, it reveals practical guidelines: use CLS features from DINO-v2 with full fine-tuning and a model LR of , apply large batches for contrastive losses with online miners, and tune the classifier LR separately (around ); in constrained settings, CosFace/ArcFace with 1 image per class can be preferable. The study finds that metric-learning losses tolerate some label noise, but more classes and larger datasets generally boost accuracy, suggesting data quantity over meticulous labeling; feature dimensionality and sampling choices also impact performance and memory use. Together, these insights offer actionable strategies for building efficient, scalable image retrieval systems under diverse compute budgets, balancing accuracy, resources, and data collection strategies.

Abstract

Image retrieval is the task of finding images in a database that are most similar to a given query image. The performance of an image retrieval pipeline depends on many training-time factors, including the embedding model architecture, loss function, data sampler, mining function, learning rate(s), and batch size. In this work, we run tens of thousands of training runs to understand the effect each of these factors has on retrieval accuracy. We also discover best practices that hold across multiple datasets. The code is available at https://github.com/gmberton/image-retrieval

Paper Structure

This paper contains 18 sections, 10 figures, 2 tables.

Figures (10)

  • Figure 1: Examples of images from the four datasets. Each row is a dataset, and each group of four images is a class (two classes per dataset). From top to bottom, the rows are Cars196, CUB, iNaturalist2018, StanfordOnlineProducts.
  • Figure 2: The number of samples per class within the training set of each dataset. The y-axis for iNaturalist2018 is in log scale.
  • Figure 3: Legend of colors for each loss throughout each section of this paper. A much smaller (almost unreadable) version of this legend is also shown in every plot.
  • Figure 4: The accuracy of each loss function versus the batch size. The red horizontal line represents the accuracy of the off-the-shelf DINO-v2; when not visible, the line is too low to be included in the plot.
  • Figure 5: The accuracy of each loss function versus the percentage of data that is incorrectly labeled. The x axis is the percentage of labels that are randomly changed during training.
  • ...and 5 more figures