Lecture notes on high-dimensional data
Sven-Ake Wegner
TL;DR
This set of notes surveys the core tendencies of high-dimensional data, focusing on concentration phenomena, Gaussian behavior, and distance-preserving reductions. It develops nonasymptotic tools (Bernstein/Chernoff bounds) and applies them to establish strong probabilistic guarantees for norms, distances, and angles of random vectors, both Gaussian and uniform in a ball or cube. It then analyzes random projections (Johnson-Lindenstrauss) as practical dimensionality-reduction devices with explicit bounds on distortion and sample sizes, and finally addresses disentangling two Gaussian clouds in high dimensions, showing that sufficient mean separation yields near-disjoint annuli and reliable separation, both theoretically and experimentally. Together, the notes provide a cohesive, nonasymptotic framework for understanding and manipulating high-dimensional data, with direct implications for clustering, nearest-neighbor methods, and dimensionality reduction.
Abstract
These are lecture notes based on the first part of a course on 'Mathematical Data Science', which I taught to final year BSc students in the UK in 2019-2020. Topics include: concentration of measure in high dimensions; Gaussian random vectors in high dimensions; random projections; separation/disentangling of Gaussian data. A revised version has been published as part of the textbook [Mathematical Introduction to Data Science, Springer, Berlin, Heidelberg, 2024, https://link.springer.com/book/10.1007/978-3-662-69426-8].
