Sailing in high-dimensional spaces: Low-dimensional embeddings through angle preservation
Jonas Fischer, Rong Ma
TL;DR
This paper introduces Mercat, a low-dimensional embedding method that preserves angles between triples of points by mapping data to the unit sphere $\mathbb{S}^2$, addressing global distortions common in distance-focused LDEs. It formalizes an angle-based loss $\mathcal{L}(X,Y)$ and develops practical strategies including PCA denoising, angle subsampling, and gradient descent on the sphere. The authors provide theoretical results under a spiked population model showing consistency of spectral angle estimators and highlight the potential bias of naive high-dimensional angle estimates. Empirically, Mercat yields competitive or superior performance on synthetic and real data across angle, distance, and sometimes density preservation, indicating a promising direction for LDE theory and practice.
Abstract
Low-dimensional embeddings (LDEs) of high-dimensional data are ubiquitous in science and engineering. They allow us to quickly understand the main properties of the data, identify outliers and processing errors, and inform the next steps of data analysis. As such, LDEs have to be faithful to the original high-dimensional data, i.e., they should represent the relationships that are encoded in the data, both at a local as well as global scale. The current generation of LDE approaches focus on reconstructing local distances between any pair of samples correctly, often out-performing traditional approaches aiming at all distances. For these approaches, global relationships are, however, usually strongly distorted, often argued to be an inherent trade-off between local and global structure learning for embeddings. We suggest a new perspective on LDE learning, reconstructing angles between data points. We show that this approach, Mercat, yields good reconstruction across a diverse set of experiments and metrics, and preserve structures well across all scales. Compared to existing work, our approach also has a simple formulation, facilitating future theoretical analysis and algorithmic improvements.
