Table of Contents
Fetching ...

Lecture notes on high-dimensional data

Sven-Ake Wegner

TL;DR

This set of notes surveys the core tendencies of high-dimensional data, focusing on concentration phenomena, Gaussian behavior, and distance-preserving reductions. It develops nonasymptotic tools (Bernstein/Chernoff bounds) and applies them to establish strong probabilistic guarantees for norms, distances, and angles of random vectors, both Gaussian and uniform in a ball or cube. It then analyzes random projections (Johnson-Lindenstrauss) as practical dimensionality-reduction devices with explicit bounds on distortion and sample sizes, and finally addresses disentangling two Gaussian clouds in high dimensions, showing that sufficient mean separation yields near-disjoint annuli and reliable separation, both theoretically and experimentally. Together, the notes provide a cohesive, nonasymptotic framework for understanding and manipulating high-dimensional data, with direct implications for clustering, nearest-neighbor methods, and dimensionality reduction.

Abstract

These are lecture notes based on the first part of a course on 'Mathematical Data Science', which I taught to final year BSc students in the UK in 2019-2020. Topics include: concentration of measure in high dimensions; Gaussian random vectors in high dimensions; random projections; separation/disentangling of Gaussian data. A revised version has been published as part of the textbook [Mathematical Introduction to Data Science, Springer, Berlin, Heidelberg, 2024, https://link.springer.com/book/10.1007/978-3-662-69426-8].

Lecture notes on high-dimensional data

TL;DR

This set of notes surveys the core tendencies of high-dimensional data, focusing on concentration phenomena, Gaussian behavior, and distance-preserving reductions. It develops nonasymptotic tools (Bernstein/Chernoff bounds) and applies them to establish strong probabilistic guarantees for norms, distances, and angles of random vectors, both Gaussian and uniform in a ball or cube. It then analyzes random projections (Johnson-Lindenstrauss) as practical dimensionality-reduction devices with explicit bounds on distortion and sample sizes, and finally addresses disentangling two Gaussian clouds in high dimensions, showing that sufficient mean separation yields near-disjoint annuli and reliable separation, both theoretically and experimentally. Together, the notes provide a cohesive, nonasymptotic framework for understanding and manipulating high-dimensional data, with direct implications for clustering, nearest-neighbor methods, and dimensionality reduction.

Abstract

These are lecture notes based on the first part of a course on 'Mathematical Data Science', which I taught to final year BSc students in the UK in 2019-2020. Topics include: concentration of measure in high dimensions; Gaussian random vectors in high dimensions; random projections; separation/disentangling of Gaussian data. A revised version has been published as part of the textbook [Mathematical Introduction to Data Science, Springer, Berlin, Heidelberg, 2024, https://link.springer.com/book/10.1007/978-3-662-69426-8].

Paper Structure

This paper contains 6 sections, 36 theorems, 157 equations.

Key Result

Theorem 1.2

Let $X\sim\mathcal{N}(0,1,\mathbb{R}^d)$. Then In particular, the expectation $\operatorname{E}(\|X\|-\sqrt{d}\space)$ converges to zero for $d\rightarrow\infty$.

Theorems & Definitions (72)

  • Definition 1.1
  • Theorem 1.2
  • proof
  • Theorem 1.3
  • proof
  • Theorem 1.4
  • proof
  • Lemma 2.1
  • proof
  • Theorem 2.2
  • ...and 62 more