Table of Contents
Fetching ...

On Probabilistic Embeddings in Optimal Dimension Reduction

Ryan Murray, Adam Pickarski

TL;DR

This work treats dimension reduction as a second-order variational problem, linking nonlinear embeddings to optimal transport through a relaxed embedding-plan formulation. It proves that relaxed minimizers exist for a broad class of costs and that the Marginal Problem constrains optimal plans to a minimal graph, enabling a finite-dimensional viewpoint. For normed and inner-product costs, the authors show that population-level minimizers are deterministic maps (Monge maps), even though naive particle methods can yield probabilistic embeddings due to nonconvexity and lack of lower semicontinuity. The results have practical implications for embedding reliability and algorithm design, and they open questions about how these phenomena manifest on real data and how to develop methods that avoid misleading probabilistic clustering.

Abstract

Dimension reduction algorithms are a crucial part of many data science pipelines, including data exploration, feature creation and selection, and denoising. Despite their wide utilization, many non-linear dimension reduction algorithms are poorly understood from a theoretical perspective. In this work we consider a generalized version of multidimensional scaling, which is posed as an optimization problem in which a mapping from a high-dimensional feature space to a lower-dimensional embedding space seeks to preserve either inner products or norms of the distribution in feature space, and which encompasses many commonly used dimension reduction algorithms. We analytically investigate the variational properties of this problem, leading to the following insights: 1) Solutions found using standard particle descent methods may lead to non-deterministic embeddings, 2) A relaxed or probabilistic formulation of the problem admits solutions with easily interpretable necessary conditions, 3) The globally optimal solutions to the relaxed problem actually must give a deterministic embedding. This progression of results mirrors the classical development of optimal transportation, and in a case relating to the Gromov-Wasserstein distance actually gives explicit insight into the structure of the optimal embeddings, which are parametrically determined and discontinuous. Finally, we illustrate that a standard computational implementation of this task does not learn deterministic embeddings, which means that it learns sub-optimal mappings, and that the embeddings learned in that context have highly misleading clustering structure, underscoring the delicate nature of solving this problem computationally.

On Probabilistic Embeddings in Optimal Dimension Reduction

TL;DR

This work treats dimension reduction as a second-order variational problem, linking nonlinear embeddings to optimal transport through a relaxed embedding-plan formulation. It proves that relaxed minimizers exist for a broad class of costs and that the Marginal Problem constrains optimal plans to a minimal graph, enabling a finite-dimensional viewpoint. For normed and inner-product costs, the authors show that population-level minimizers are deterministic maps (Monge maps), even though naive particle methods can yield probabilistic embeddings due to nonconvexity and lack of lower semicontinuity. The results have practical implications for embedding reliability and algorithm design, and they open questions about how these phenomena manifest on real data and how to develop methods that avoid misleading probabilistic clustering.

Abstract

Dimension reduction algorithms are a crucial part of many data science pipelines, including data exploration, feature creation and selection, and denoising. Despite their wide utilization, many non-linear dimension reduction algorithms are poorly understood from a theoretical perspective. In this work we consider a generalized version of multidimensional scaling, which is posed as an optimization problem in which a mapping from a high-dimensional feature space to a lower-dimensional embedding space seeks to preserve either inner products or norms of the distribution in feature space, and which encompasses many commonly used dimension reduction algorithms. We analytically investigate the variational properties of this problem, leading to the following insights: 1) Solutions found using standard particle descent methods may lead to non-deterministic embeddings, 2) A relaxed or probabilistic formulation of the problem admits solutions with easily interpretable necessary conditions, 3) The globally optimal solutions to the relaxed problem actually must give a deterministic embedding. This progression of results mirrors the classical development of optimal transportation, and in a case relating to the Gromov-Wasserstein distance actually gives explicit insight into the structure of the optimal embeddings, which are parametrically determined and discontinuous. Finally, we illustrate that a standard computational implementation of this task does not learn deterministic embeddings, which means that it learns sub-optimal mappings, and that the embeddings learned in that context have highly misleading clustering structure, underscoring the delicate nature of solving this problem computationally.
Paper Structure (9 sections, 15 theorems, 87 equations, 3 figures, 1 table)

This paper contains 9 sections, 15 theorems, 87 equations, 3 figures, 1 table.

Key Result

Proposition 2.2

If the functionals are finite for functions in $L^p(\mathbb{R}^d;\mathbb{R}^m|\mu)$, and $T \equiv 0$ is not the global minimizer, then $\mathcal{J}_{\mathbf{IP}}\,\&\,\mathcal{J}_{\mathbf{N}^2}$ are neither convex nor concave on $L^p(\mathbb{R}^d;\mathbb{R}^m|\mu)$.

Figures (3)

  • Figure 1: An example where standard algorithms find locally optimal solutions which are not maps. Here the position of the points represents the original features in $\mathcal{X} = \mathbb{R}^2$, whereas the color represents the learned embedding in $\mathcal{Y} = \mathbb{R}$. The first graph shows the embedding learned by the implementation of metric MDS in Scikit-learn, and the second graph shows the embedding Scikit learn finds if given an analytically-motivated initial guess. The stress values, normalized by the the number of points squared, is also displayed, with a clear improvement in the second image.
  • Figure 2: Each band represents an equivalence class of points in $\mathbb{R}^2$ which all have the same minimizer in $\mathbb{R}$ for the embedding outlined in Example \ref{['exmp:equivilance_classes']}. Notice that once $|x|>\sqrt{2}$, the line $x_2=0$ has a discontinuity surface.
  • Figure :

Theorems & Definitions (38)

  • Example 1.1
  • Example 2.1
  • Proposition 2.2
  • proof
  • Example 2.3: Double-well Potential
  • Example 2.4
  • Proposition 2.5
  • Proposition 2.6
  • proof
  • Definition 2.7: Tightness of Embedding Plans
  • ...and 28 more