Table of Contents
Fetching ...

Classification is a Strong Baseline for Deep Metric Learning

Andrew Zhai, Hao-Yu Wu

TL;DR

The paper investigates whether classification-based training can serve as a strong baseline for deep metric learning in image retrieval. It introduces Normalized Softmax Loss combined with Layer Normalization and class-balanced sampling, showing competitive performance across standard retrieval datasets and backbones. It also demonstrates scalability via class subsampling and the viability of high-dimensional binary embeddings that match or exceed float-embedding performance at the same memory cost. Overall, the work advocates classification-based approaches as practical and effective for large-scale metric learning tasks.

Abstract

Deep metric learning aims to learn a function mapping image pixels to embedding feature vectors that model the similarity between images. Two major applications of metric learning are content-based image retrieval and face verification. For the retrieval tasks, the majority of current state-of-the-art (SOTA) approaches are triplet-based non-parametric training. For the face verification tasks, however, recent SOTA approaches have adopted classification-based parametric training. In this paper, we look into the effectiveness of classification based approaches on image retrieval datasets. We evaluate on several standard retrieval datasets such as CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering, and establish that our classification-based approach is competitive across different feature dimensions and base feature networks. We further provide insights into the performance effects of subsampling classes for scalable classification-based training, and the effects of binarization, enabling efficient storage and computation for practical applications.

Classification is a Strong Baseline for Deep Metric Learning

TL;DR

The paper investigates whether classification-based training can serve as a strong baseline for deep metric learning in image retrieval. It introduces Normalized Softmax Loss combined with Layer Normalization and class-balanced sampling, showing competitive performance across standard retrieval datasets and backbones. It also demonstrates scalability via class subsampling and the viability of high-dimensional binary embeddings that match or exceed float-embedding performance at the same memory cost. Overall, the work advocates classification-based approaches as practical and effective for large-scale metric learning tasks.

Abstract

Deep metric learning aims to learn a function mapping image pixels to embedding feature vectors that model the similarity between images. Two major applications of metric learning are content-based image retrieval and face verification. For the retrieval tasks, the majority of current state-of-the-art (SOTA) approaches are triplet-based non-parametric training. For the face verification tasks, however, recent SOTA approaches have adopted classification-based parametric training. In this paper, we look into the effectiveness of classification based approaches on image retrieval datasets. We evaluate on several standard retrieval datasets such as CAR-196, CUB-200-2011, Stanford Online Product, and In-Shop datasets for image retrieval and clustering, and establish that our classification-based approach is competitive across different feature dimensions and base feature networks. We further provide insights into the performance effects of subsampling classes for scalable classification-based training, and the effects of binarization, enabling efficient storage and computation for practical applications.

Paper Structure

This paper contains 16 sections, 1 equation, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Architecture overview for training high dimensional binary embedding
  • Figure 2: Recall@1 for CARS-196 (left) and CUB-200-2011 (right) across varying embedding dimensions. Softmax based embeddings improve performance when increasing dimensionality. The performance gap between float and binary embeddings converge when increasing dimensionality, showing that when given enough representational freedom, Softmax learns bit like features.
  • Figure 3: Loss and R@1 trends for training CUB-200 ResNet50 with and without Layer Normalization. Layer Normalization helps initialize learning, leading to better training convergence and R@1 performance.
  • Figure 4: Recall@K for SOP with ResNet50 across class sampling ratios. We see that with sampling only 10% of classes per iteration ($\sim$1K classes), we converge to a R@1 that is less than 1% absolute drop in performance from using all classes.