Table of Contents
Fetching ...

PD-Loss: Proxy-Decidability for Efficient Metric Learning

Pedro Silva, Guilherme A. L. Silva, Pablo Coelho, Vander Freitas, Gladston Moreira, David Menotii, Eduardo Luz

TL;DR

The paper tackles distribution-aware deep metric learning by addressing the batch-size sensitivity of Decidability Loss (D-Loss) through Proxy-Decidability Loss (PD-Loss). PD-Loss uses learnable class proxies to estimate genuine and impostor similarity distributions and optimizes a log-transformed, temperature-scaled version of the Decidability Index $d'$ without relying on exhaustive pairwise mining. Empirical results on CUB-200-2011, CARS196, and LFW show PD-Loss achieving competitive or state-of-the-art performance while maintaining better batch-size robustness and training efficiency. The approach blends the scalability of proxy-based losses with the principled distribution-separability objective, suggesting broad applicability to embedding optimization tasks beyond standard metric learning benchmarks.

Abstract

Deep Metric Learning (DML) aims to learn embedding functions that map semantically similar inputs to proximate points in a metric space while separating dissimilar ones. Existing methods, such as pairwise losses, are hindered by complex sampling requirements and slow convergence. In contrast, proxy-based losses, despite their improved scalability, often fail to optimize global distribution properties. The Decidability-based Loss (D-Loss) addresses this by targeting the decidability index (d') to enhance distribution separability, but its reliance on large mini-batches imposes significant computational constraints. We introduce Proxy-Decidability Loss (PD-Loss), a novel objective that integrates learnable proxies with the statistical framework of d' to optimize embedding spaces efficiently. By estimating genuine and impostor distributions through proxies, PD-Loss combines the computational efficiency of proxy-based methods with the principled separability of D-Loss, offering a scalable approach to distribution-aware DML. Experiments across various tasks, including fine-grained classification and face verification, demonstrate that PD-Loss achieves performance comparable to that of state-of-the-art methods while introducing a new perspective on embedding optimization, with potential for broader applications.

PD-Loss: Proxy-Decidability for Efficient Metric Learning

TL;DR

The paper tackles distribution-aware deep metric learning by addressing the batch-size sensitivity of Decidability Loss (D-Loss) through Proxy-Decidability Loss (PD-Loss). PD-Loss uses learnable class proxies to estimate genuine and impostor similarity distributions and optimizes a log-transformed, temperature-scaled version of the Decidability Index without relying on exhaustive pairwise mining. Empirical results on CUB-200-2011, CARS196, and LFW show PD-Loss achieving competitive or state-of-the-art performance while maintaining better batch-size robustness and training efficiency. The approach blends the scalability of proxy-based losses with the principled distribution-separability objective, suggesting broad applicability to embedding optimization tasks beyond standard metric learning benchmarks.

Abstract

Deep Metric Learning (DML) aims to learn embedding functions that map semantically similar inputs to proximate points in a metric space while separating dissimilar ones. Existing methods, such as pairwise losses, are hindered by complex sampling requirements and slow convergence. In contrast, proxy-based losses, despite their improved scalability, often fail to optimize global distribution properties. The Decidability-based Loss (D-Loss) addresses this by targeting the decidability index (d') to enhance distribution separability, but its reliance on large mini-batches imposes significant computational constraints. We introduce Proxy-Decidability Loss (PD-Loss), a novel objective that integrates learnable proxies with the statistical framework of d' to optimize embedding spaces efficiently. By estimating genuine and impostor distributions through proxies, PD-Loss combines the computational efficiency of proxy-based methods with the principled separability of D-Loss, offering a scalable approach to distribution-aware DML. Experiments across various tasks, including fine-grained classification and face verification, demonstrate that PD-Loss achieves performance comparable to that of state-of-the-art methods while introducing a new perspective on embedding optimization, with potential for broader applications.

Paper Structure

This paper contains 17 sections, 6 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Conceptual illustration of PD-Loss. Left: Samples from different classes ($\textcolor{cyan}{\star}, \textcolor{magenta}{+}, \textcolor{blue}{\diamond}$) and their corresponding learnable proxies ($\textcolor{black}{\star}, \textcolor{black}{+}, \textcolor{black}{\diamond}$) co-exist in the embedding space. Center: Scaled similarities between samples and proxies are computed. These similarities form the input for the PD-Loss function. Right: The PD-Loss objective aims to maximize the Decidability Index ($d'$) derived from the estimated genuine (sample-to-correct-proxy) and impostor (sample-to-incorrect-proxy) similarity distributions. This involves maximizing the separation between their means ($\mu_{gen}, \mu_{imp}$) and minimizing their variances ($\sigma^2_{gen}, \sigma^2_{imp}$). Both the backbone network and the proxies are updated via gradient descent.
  • Figure 2: Comparison against the D-Loss, PD-Loss without the logarithmic term, and with it. The R@1 is calculated over the epochs in the validation split of CUB200 dataset. We used a batch size of 500, the ResNet-18 architecture, and the embedding size of 256.
  • Figure 3: Presents a comparison of different losses in terms of R@1 over training epochs.
  • Figure 4: Presents the analysis of the batch size impact in PD-Loss (Random Init $\tau=1.0$) on CUB-200 val set (EMB 512, 500 epochs), with R@1 curves for BS$=\{32, 64\}$.
  • Figure 5: PD-Loss Random $\tau=1.0$ is compared against baselines: Comparative Recall@K performance on (a) CUB-200, (b) CARS196, and (c) LFW test sets. (d) Average training time per epoch.
  • ...and 1 more figures