Table of Contents
Fetching ...

Geometric Data Science

Olga D Anosova, Vitaliy A Kurlin

TL;DR

This work formulates Geometric Data Science as a rigorous framework to compare real objects by moduli spaces under practical equivalences, anchoring analysis in invariants and metrics with guaranteed continuity and polynomial-time computability. It develops complete, Lipschitz-continuous invariants for finite clouds (BRI, PCI, WMI, PDD/SDD/SCD) and extends these ideas to periodic objects (1-periodic sequences, lattices, and density functions), delivering hierarchical invariants and efficient comparison algorithms. A key contribution is the construction of geomaps and moduli-space embeddings (e.g., RI/PI spaces, SLM spherical mapping) that enable robust, geodesic-style navigation of object universes, including biological macromolecules and crystalline materials. The framework unifies classical invariants (distance matrices, Gram matrices) with modern metric geometry, enabling fast, scalable detection of duplicates, isometry-invariant classification, and continuous measures of chirality and symmetry across both finite and periodic data. These advances have practical implications for materials discovery, protein structure analysis, and crystallography, providing tools to systematically explore and compare vast geometric datasets.

Abstract

This book introduces the new research area of Geometric Data Science, where data can represent any real objects through geometric measurements. The first part of the book focuses on finite point sets. The most important result is a complete and continuous classification of all finite clouds of unordered points under rigid motion in any Euclidean space. The key challenge was to avoid the exponential complexity arising from permutations of the given unordered points. For a fixed dimension of the ambient Euclidean space, the times of all algorithms for the resulting invariants and distance metrics depend polynomially on the number of points. The second part of the book advances a similar classification in the much more difficult case of periodic point sets, which model all periodic crystals at the atomic scale. The most significant result is the hierarchy of invariants from the ultra-fast to complete ones. The key challenge was to resolve the discontinuity of crystal representations that break down under almost any noise. Experimental validation on all major materials databases confirmed the Crystal Isometry Principle: any real periodic crystal has a unique location in a common moduli space of all periodic structures under rigid motion. The resulting moduli space contains all known and not yet discovered periodic crystals and hence continuously extends Mendeleev's table to the full crystal universe.

Geometric Data Science

TL;DR

This work formulates Geometric Data Science as a rigorous framework to compare real objects by moduli spaces under practical equivalences, anchoring analysis in invariants and metrics with guaranteed continuity and polynomial-time computability. It develops complete, Lipschitz-continuous invariants for finite clouds (BRI, PCI, WMI, PDD/SDD/SCD) and extends these ideas to periodic objects (1-periodic sequences, lattices, and density functions), delivering hierarchical invariants and efficient comparison algorithms. A key contribution is the construction of geomaps and moduli-space embeddings (e.g., RI/PI spaces, SLM spherical mapping) that enable robust, geodesic-style navigation of object universes, including biological macromolecules and crystalline materials. The framework unifies classical invariants (distance matrices, Gram matrices) with modern metric geometry, enabling fast, scalable detection of duplicates, isometry-invariant classification, and continuous measures of chirality and symmetry across both finite and periodic data. These advances have practical implications for materials discovery, protein structure analysis, and crystallography, providing tools to systematically explore and compare vast geometric datasets.

Abstract

This book introduces the new research area of Geometric Data Science, where data can represent any real objects through geometric measurements. The first part of the book focuses on finite point sets. The most important result is a complete and continuous classification of all finite clouds of unordered points under rigid motion in any Euclidean space. The key challenge was to avoid the exponential complexity arising from permutations of the given unordered points. For a fixed dimension of the ambient Euclidean space, the times of all algorithms for the resulting invariants and distance metrics depend polynomially on the number of points. The second part of the book advances a similar classification in the much more difficult case of periodic point sets, which model all periodic crystals at the atomic scale. The most significant result is the hierarchy of invariants from the ultra-fast to complete ones. The key challenge was to resolve the discontinuity of crystal representations that break down under almost any noise. Experimental validation on all major materials databases confirmed the Crystal Isometry Principle: any real periodic crystal has a unique location in a common moduli space of all periodic structures under rigid motion. The resulting moduli space contains all known and not yet discovered periodic crystals and hence continuously extends Mendeleev's table to the full crystal universe.

Paper Structure

This paper contains 74 sections, 104 theorems, 85 equations, 88 figures, 29 tables.

Key Result

Lemma 2.1.3

(a) A symmetric $m\times m$ matrix of $s_{ij}\geq 0$ with $s_{ii}=0$ is realisable as a matrix of squared distances between $p_0=0,p_1,\dots,p_{m-1}\in\mathbb{R}^n$ for some $n$if and only if the $(m-1)\times(m-1)$ matrix $G$ of $g_{ij}=\dfrac{s_{0i}+s_{0j}-s_{ij}}{2}$ has only non-negative eigenval

Figures (88)

  • Figure 1: The main questions of Geometric Data Science are illustrated for molecules: H2O, CO2, CH4.
  • Figure 2: Left: in the Euclidean line $\mathbb{R}$. the clouds $A$ of 4 green points and $B$ of 4 blue points have a small Hausdorff distance $\mathrm{HD}$. Right: the same clouds $A,B\subset\mathbb{R}$ have a large bottleneck distance $\mathrm{BD}$ based on a bijection $g:A\to B$ (shown by red arrows), which minimises the maximum deviation of points, see Example \ref{['exa:metrics']}(b).
  • Figure 3: Left: a geocode $I$ from Problem \ref{['pro:geocodes']} is illustrated for triangles (3-point clouds) whose isometry classes form a moduli space, which can be mapped like Earth. Right: the Cloud Isometry Space $\mathrm{CIS}(\mathbb{R}^n;3)$ is continuously parametrised by triples of inter-point distances $0<a\leq b\leq c\leq a+b$.
  • Figure 4: Left: the key concepts are introduced in Definitions \ref{['dfn:equivalence']}, \ref{['dfn:invariants']}, \ref{['dfn:metrics']}, \ref{['dfn:Lipschitz']}, and \ref{['dfn:complexity']}, all linked in Problem \ref{['pro:geocodes']}. Right: the main objects are finite and periodic sets of unordered points, including lattices in $\mathbb{R}^2$ whose space under rigid motion was the first solution to Problem \ref{['pro:geocodes']}.
  • Figure 5: Left: all main atoms $N_i$, $A_i$, $C_i$ of a protein chain form a backbone embedded in $\mathbb{R}^3$. Middle: each triangle $\triangle N_i A_i C_i$ defines an orthonormal basis $\vb*{u}_i,\vb*{v}_i,\vb*{w}_i$. The coordinates of the bonds $\overrightarrow{C_i N_{i+1}}$, $\overrightarrow{N_{i+1}A_{i+1}}$, $\overrightarrow{A_{i+1}C_{i+1}}$ in this basis form the complete Backbone Rigid Invariant $\mathrm{BRI}$. Right: All rigidly equivalent backbones form a single rigid class. All rigid classes form the Backbone Rigid Space. The image schematically illustrates four different classes of simple polygonal chains in $\mathbb{R}^3$.
  • ...and 83 more figures

Theorems & Definitions (282)

  • Definition 1.2.1: equivalence relation
  • Example 1.2.2: non-equivalences
  • Example 1.2.3: rigid motion, isometry, dilation, and homothety
  • Definition 1.2.4: weaker vs stronger equivalences
  • Definition 1.2.5: invariants and complete invariants
  • Example 1.2.6: invariants vs non-invariants
  • Definition 1.3.1: metrics and pseudo-metrics
  • Definition 1.3.2: metric spaces and clouds
  • Example 1.3.3: Minkowski metrics, Hausdorff and bottleneck distances
  • Definition 1.3.4: Lipschitz continuity
  • ...and 272 more