Manifold learning: what, how, and why
Marina Meilă, Hanyu Zhang
TL;DR
This survey synthesizes the mathematical and statistical foundations of manifold learning, detailing how neighborhood graphs, local linear approximations, and embedding algorithms reveal low-dimensional manifold structure in high-dimensional data. It contrasts one-shot spectral methods (Isomap, Diffusion Maps, Laplacian Eigenmaps, LTSA) with relaxation-based neighbor embeddings (t-SNE, UMAP), highlighting their guarantees, limitations, and susceptibility to distortions like the REP. The work emphasizes the role of the Laplace-Beltrami operator, intrinsic dimension estimation, and scale selection as core statistical challenges, and discusses practical guidance for applications in statistics and the sciences. Overall, it frames manifold learning as a principled toolkit for visualization, regularization, and scientific discovery, while acknowledging existing gaps in isometric embedding and robust dimension inference.
Abstract
Manifold learning (ML), known also as non-linear dimension reduction, is a set of methods to find the low dimensional structure of data. Dimension reduction for large, high dimensional data is not merely a way to reduce the data; the new representations and descriptors obtained by ML reveal the geometric shape of high dimensional point clouds, and allow one to visualize, de-noise and interpret them. This survey presents the principles underlying ML, the representative methods, as well as their statistical foundations from a practicing statistician's perspective. It describes the trade-offs, and what theory tells us about the parameter and algorithmic choices we make in order to obtain reliable conclusions.
