Table of Contents
Fetching ...

How high is `high'? Rethinking the roles of dimensionality in topological data analysis and manifold learning

Hannah Sansford, Nick Whiteley, Patrick Rubin-Delanchy

TL;DR

The paper tackles how to reconcile practical data geometry with statistical theory by introducing a generalised Hanson-Wright inequality and three dimensionality notions: ambient intrinsic dimension $p_{\mathrm{int}}$, correlation rank $r$, and latent intrinsic dimension $d$. It develops a random function model that links observed point-clouds $\mathcal{Y}_n$ to latent manifolds $\mathcal{M}$ via Mercer kernels, and establishes persistence-diagram consistency without requiring $p\gg n$. The authors also provide practical isometry diagnostics between latent space $\mathcal{Z}$ and observed geometry $\mathcal{M}$, and demonstrate evidence that grid-cell activity encodes a geometrically faithful map of physical space, with a toroidal structure isometric to the world under an appropriate model (Model 3). Overall, the work broadens the understanding of high-dimensional data geometry, showing that latent topology and manifold structure can emerge under mild growth of ambient-derived dimensions and that isometric relations between observed neural activity and real space can be detected in practice.

Abstract

We present a generalised Hanson-Wright inequality and use it to establish new statistical insights into the geometry of data point-clouds. In the setting of a general random function model of data, we clarify the roles played by three notions of dimensionality: ambient intrinsic dimension $p_{\mathrm{int}}$, which measures total variability across orthogonal feature directions; correlation rank, which measures functional complexity across samples; and latent intrinsic dimension, which is the dimension of manifold structure hidden in data. Our analysis shows that in order for persistence diagrams to reveal latent homology and for manifold structure to emerge it is sufficient that $p_{\mathrm{int}}\gg \log n$, where $n$ is the sample size. Informed by these theoretical perspectives, we revisit the ground-breaking neuroscience discovery of toroidal structure in grid-cell activity made by Gardner et al. (Nature, 2022): our findings reveal, for the first time, evidence that this structure is in fact isometric to physical space, meaning that grid cell activity conveys a geometrically faithful representation of the real world.

How high is `high'? Rethinking the roles of dimensionality in topological data analysis and manifold learning

TL;DR

The paper tackles how to reconcile practical data geometry with statistical theory by introducing a generalised Hanson-Wright inequality and three dimensionality notions: ambient intrinsic dimension , correlation rank , and latent intrinsic dimension . It develops a random function model that links observed point-clouds to latent manifolds via Mercer kernels, and establishes persistence-diagram consistency without requiring . The authors also provide practical isometry diagnostics between latent space and observed geometry , and demonstrate evidence that grid-cell activity encodes a geometrically faithful map of physical space, with a toroidal structure isometric to the world under an appropriate model (Model 3). Overall, the work broadens the understanding of high-dimensional data geometry, showing that latent topology and manifold structure can emerge under mild growth of ambient-derived dimensions and that isometric relations between observed neural activity and real space can be detected in practice.

Abstract

We present a generalised Hanson-Wright inequality and use it to establish new statistical insights into the geometry of data point-clouds. In the setting of a general random function model of data, we clarify the roles played by three notions of dimensionality: ambient intrinsic dimension , which measures total variability across orthogonal feature directions; correlation rank, which measures functional complexity across samples; and latent intrinsic dimension, which is the dimension of manifold structure hidden in data. Our analysis shows that in order for persistence diagrams to reveal latent homology and for manifold structure to emerge it is sufficient that , where is the sample size. Informed by these theoretical perspectives, we revisit the ground-breaking neuroscience discovery of toroidal structure in grid-cell activity made by Gardner et al. (Nature, 2022): our findings reveal, for the first time, evidence that this structure is in fact isometric to physical space, meaning that grid cell activity conveys a geometrically faithful representation of the real world.

Paper Structure

This paper contains 36 sections, 11 theorems, 111 equations, 5 figures.

Key Result

Theorem 1

Let $\mathbf{X}=(X_{1},\ldots,X_{p})$ and $\mathbf{X}^{\prime}=(X_{1}^{\prime},\ldots,X_{p}^{\prime})$ be $\mathbb{R}^{p}$-valued random vectors such that the pairs $(X_{j},X_{j}^{\prime})$, $j=1,\ldots,p$ are mutually independent, and $\mathbb{E}[X_{j}]=\mathbb{E}[X_{j}^{\prime}]=0$ and $\|X_{j}\|_

Figures (5)

  • Figure 1: Ambient intrinsic dimension $p_{\mathrm{int}}$, correlation rank $r$, and latent intrinsic dimension $d$ at play in simulation from a toy example of the random function model with $n=1000$. (a)-(c) show SVD visualisation of simulated data with respectively $p_{\mathrm{int}}=3, 8, 20$; as $p_{\mathrm{int}}$ grows, the $d=1$-dimensional manifold $\mathcal{M}=\{\phi(z);z\in\mathcal{Z}\}$ shown in (d) emerges in a $r=3$-dimensional subspace. In this example $\mathcal{M}$ is homeomorphic to the latent space $\mathcal{Z}$, which is a circle. See section \ref{['sec:data_model']} for details.
  • Figure 2: Re-creating the grid cell analysis of gardner2022toroidal. (a): UMAP visualisation of the grid cell data $\mathbf{Y}_1,\ldots,\mathbf{Y}_n$. This visualisation suggest the presence of toroidal structure, confirmed by the persistence diagram in (b), indicating Betti numbers $H_0=1$, $H_1=2$, $H_2=1$. In (c), cohomological decoding maps the circular coordinates of the torus to a rhombus. (d) shows how these coordinates correspond to physical space through tesselation of the rhombus. In (a), (c) and (d), points are colored by the first component in the PCA embedding of the data to aid visual recognition of the torus.
  • Figure 3: (a) For 3 possible source locations shown in blue, all other physical locations visited by the rat are colored by the shortest path-length in $\mathcal{Y}_n$ from the source. (b)Left: The tessellated rhombus is plotted atop the physical locations. Two physical locations (blue, green), and the shortest path in $\mathcal{Z}_n$ between them under Model 1 are shown. Middle: Representations of the same source and sink locations under Model 2, in which physical locations are superimposed on the central rhombus in the tesselation. Shortest path in $\mathcal{Z}_n$ under Model 2 is shown. Right: Under Model 3, the distance $d_{\mathcal{Z}}$ allows for 'teleporting' (dashed line). (c) relationships between shortest path-lengths in $\mathcal{Y}_n$ and in $\mathcal{Z}_n$ for Models 1-3 (left to right). Orange lines show best linear fit and red lines show moving averages (shading for $\pm 1$s.d.). Strongest evidence of isometry appears for Model 3.
  • Figure 4: Shortest path lengths on $\mathcal{Z}_n$ from Model 3 are calculated by first re-tessellating the $z_i$ from Model 2 into the 8 rhombi surrounding the central rhombus (illustrated above). Shortest paths to the same destination point in each of the rhombi are calculated and the minimum of the 9 distances is taken to be the distance under Model 3. Here, the shortest path is to the point in the rhombus directly above the central rhombus.
  • Figure 5: The top two rows show $z_i$ under Model 4 for three different ratios of torus radii $R/r \in \{1.5, 2, 2.5\}$ (left to right). The bottom row shows distance-distance plots comparing the shortest paths on $\mathcal{Y}_n$ and $\mathcal{Z}_n$ for each torus. Unlike Model 3, these embeddings exhibit a pronounced deviation from linearity, especially at larger distances.

Theorems & Definitions (17)

  • Theorem 1
  • Proposition 1
  • Theorem 2
  • Proposition 2
  • Lemma 1
  • proof : Proof of theorem \ref{['thm:HW_inequality']}
  • proof : Proof of proposition \ref{['prop:iid']}
  • Lemma 2
  • Lemma 3
  • Lemma 4
  • ...and 7 more