Table of Contents
Fetching ...

Piecewise-Linear Manifolds for Deep Metric Learning

Shubhang Bhatnagar, Narendra Ahuja

TL;DR

This work empirically shows that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques, and outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.

Abstract

Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.

Piecewise-Linear Manifolds for Deep Metric Learning

TL;DR

This work empirically shows that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques, and outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.

Abstract

Unsupervised deep metric learning (UDML) focuses on learning a semantic representation space using only unlabeled data. This challenging problem requires accurately estimating the similarity between data points, which is used to supervise a deep network. For this purpose, we propose to model the high-dimensional data manifold using a piecewise-linear approximation, with each low-dimensional linear piece approximating the data manifold in a small neighborhood of a point. These neighborhoods are used to estimate similarity between data points. We empirically show that this similarity estimate correlates better with the ground truth than the similarity estimates of current state-of-the-art techniques. We also show that proxies, commonly used in supervised metric learning, can be used to model the piecewise-linear manifold in an unsupervised setting, helping improve performance. Our method outperforms existing unsupervised metric learning approaches on standard zero-shot image retrieval benchmarks.
Paper Structure (27 sections, 9 equations, 3 figures, 3 tables)

This paper contains 27 sections, 9 equations, 3 figures, 3 tables.

Figures (3)

  • Figure 1: We visualize a 2D data manifold along which data embeddings lie. It may consist of multiple submanifolds, three of which are shown here in red, yellow and blue. Embeddings obtained using pre-trained features (left) are not optimal, with semantically dissimilar points (different colored) being closer than similar points. Our method helps improve the feature space in 3 steps carried out iteratively: (1) Identify and approximate the submanifold at each point (eg. Ellipses A-E) by a linear model over aneighborhood small enough to (i) not contain multiple submanifolds and (ii) the contained single submanifold is adequately linear. Such low-dimensional (1D) subspaces are assumed to contain semantically similar points (empirically backed by Sec 4.3). (2) Estimate similarity for each pair of points ($\textbf{x}_{1},\textbf{x}_{2}$), in terms of the lengths of projections of vector $\textbf{x}_{1}-\textbf{x}_{2}$ on the linear neighborhood D, $p_{1,2}$, and on the normal to D, $o_{1,2}$. Similarity decays faster orthogonal to a neighborhood than along it. (3) Train network embedding to bring similar points closer together and push dissimilar points apart. This brings closer together points within the same low-dimensional neighborhoods or those in different neighborhoods but the same low-dimensional (1D) space (eg. A, B, C) closer together. Points not lying in the same low dimensional space (eg. B, D, E) are pushed away from each other.
  • Figure 2: An overview of our method. Points are selected from the dataset using the neighborhood sampling strategy (Sec. \ref{['sec:nn_sampling']}), followed by the calculation of their embeddings using the network $f_{\theta}$ and the momentum encoder $f_{\phi}$ (Sec. \ref{['sec:momentum_encoder']}). Embeddings generated by the momentum encoder are used to construct a piecewise linear approximation (Sec. \ref{['sec:pl_const_algo']}) of the data manifold. These embeddings are used to calculate point-point (Sec. \ref{['sec:point_sim']}) and proxy-point (Sec. \ref{['sec:proxy_sim']}) similarities. The similarities are used to modulate the distance between point-point (Sec.\ref{['sec:point_loss']}) and proxy-point (Sec. \ref{['sec:proxy_loss']}) pairs by updating the network $f_{\theta}$. Locations and neighborhoods of proxies (as described in Sec. \ref{['sec:proxy_manifold']}) are also updated using the proxy-point (Sec. \ref{['sec:proxy_loss']}) and proxy-neighborhood loss (Sec. \ref{['sec:proxy_neighborhood_loss']}) components through backpropagation. Losses colored yellow/green are calculated only using quantities with the same color
  • Figure 3: Variation in Recall@1 with $N_{\boldsymbol\rho}$, $m$,$N_{\alpha}$, $N_{\beta}$ and $\delta$ on the CUB-200-2011 dataset when using a 128 dim embedding GoogLeNet backbone. Error bars represent standard deviations over 5 runs.