Table of Contents
Fetching ...

Sailing in high-dimensional spaces: Low-dimensional embeddings through angle preservation

Jonas Fischer, Rong Ma

TL;DR

This paper introduces Mercat, a low-dimensional embedding method that preserves angles between triples of points by mapping data to the unit sphere $\mathbb{S}^2$, addressing global distortions common in distance-focused LDEs. It formalizes an angle-based loss $\mathcal{L}(X,Y)$ and develops practical strategies including PCA denoising, angle subsampling, and gradient descent on the sphere. The authors provide theoretical results under a spiked population model showing consistency of spectral angle estimators and highlight the potential bias of naive high-dimensional angle estimates. Empirically, Mercat yields competitive or superior performance on synthetic and real data across angle, distance, and sometimes density preservation, indicating a promising direction for LDE theory and practice.

Abstract

Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in science and engineering. They allow us to quickly understand the main properties of the data, identify outliers and processing errors, and inform the next steps of data analysis. As such, LDEs have to be faithful to the original high-dimensional data, i.e., they should represent the relationships that are encoded in the data, both at a local as well as global scale. The current generation of LDE approaches focus on reconstructing local distances between any pair of samples correctly, often out-performing traditional approaches aiming at all distances. For these approaches, global relationships are, however, usually strongly distorted, often argued to be an inherent trade-off between local and global structure learning for embeddings. We suggest a new perspective on LDE learning, reconstructing angles between data points. We show that this approach, Mercat, yields good reconstruction across a diverse set of experiments and metrics, and preserve structures well across all scales. Compared to existing work, our approach also has a simple formulation, facilitating future theoretical analysis and algorithmic improvements.

Sailing in high-dimensional spaces: Low-dimensional embeddings through angle preservation

TL;DR

This paper introduces Mercat, a low-dimensional embedding method that preserves angles between triples of points by mapping data to the unit sphere , addressing global distortions common in distance-focused LDEs. It formalizes an angle-based loss and develops practical strategies including PCA denoising, angle subsampling, and gradient descent on the sphere. The authors provide theoretical results under a spiked population model showing consistency of spectral angle estimators and highlight the potential bias of naive high-dimensional angle estimates. Empirically, Mercat yields competitive or superior performance on synthetic and real data across angle, distance, and sometimes density preservation, indicating a promising direction for LDE theory and practice.

Abstract

Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in science and engineering. They allow us to quickly understand the main properties of the data, identify outliers and processing errors, and inform the next steps of data analysis. As such, LDEs have to be faithful to the original high-dimensional data, i.e., they should represent the relationships that are encoded in the data, both at a local as well as global scale. The current generation of LDE approaches focus on reconstructing local distances between any pair of samples correctly, often out-performing traditional approaches aiming at all distances. For these approaches, global relationships are, however, usually strongly distorted, often argued to be an inherent trade-off between local and global structure learning for embeddings. We suggest a new perspective on LDE learning, reconstructing angles between data points. We show that this approach, Mercat, yields good reconstruction across a diverse set of experiments and metrics, and preserve structures well across all scales. Compared to existing work, our approach also has a simple formulation, facilitating future theoretical analysis and algorithmic improvements.
Paper Structure (26 sections, 3 theorems, 36 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 26 sections, 3 theorems, 36 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Theorem 1

Suppose that $\sigma_1\ge...\ge \sigma_r\ge 1+c$ for some constant $c>0$. Then for any $i,j\in\{1,2,...,n\}$, we have that, as $(n,d)\to\infty$, for any small constant $\epsilon>0$, where we denote $\bold{u}_k=(u_{k1}, ..., u_{kn})$, for $1\le k\le r$.

Figures (11)

  • Figure 1: Visual abstract. Existing work (top) optimizes low-dimensional embeddings to reconstruct distances, focusing on reconstruction of local structures (smaller distances), leading to distortion or breaking of global structures (larger distances). We suggest (bottom) to reconstruct angles between any three points, embedding on the sphere 2D sphere $\mathbb{S}^2$, capturing structures at any scale.
  • Figure 2: Embeddings of low-dimensional examples. We visualize the Smiley (top), Mammoth (middle), and Circle (bottom) data and computed embeddings.
  • Figure 3: Computing sphere angles with linear algebra. We visualize the idea of computing the angle $\alpha$ between two (geodesic) paths $\overline{AB}, \overline{AC}$ on a sphere. The key insight is that the angle between the two geodesics is the same as the angle between the normals (visualized as arrows) of the two triangles $\Delta OAB$, $\Delta OAC$ in the ambient 3D space, with $O$ as center of the sphere.
  • Figure 4: Spectral analysis of angle space. For 500 samples randomly taken from human hematopoiesis data paulhema we show (a) the singular values of the matrix $\Theta_i$ of cosine-angles at sample $i$ (one line per sample) and (b) the distribution of effective rank of all $\Theta_i$ on this dataset. Angle matrices are of low (effective) rank, thus encourage subsampling of angles.
  • Figure 5: Embeddings for Mammoth with varying neighborhood size. Visualizations for the Mammoth datasets for various neighborhood parameter setting for existing work, using neighborhood/perplexity scores of $\theta\in\{10,20,50,100,200\}$.
  • ...and 6 more figures

Theorems & Definitions (3)

  • Theorem 1: Guarantee of spectral angle estimators
  • Theorem 2: Limitation of naive angle estimators
  • Lemma 1