Table of Contents
Fetching ...

Measuring the Data

Ido Cohen

TL;DR

Measuring the Data addresses the challenge of analytically determining the intrinsic dimension $M$ of sparse, nonlinear data. It combines Optimal Transport to generate parametric curves on the data manifold and Koopman Regularization to derive a nonlinear mapping to the intrinsic coordinates, leveraging the fact that the tangent space at a data point is isomorphic to $\mathbb{R}^M$ and that the Koopman eigenfunction space is finite dimensional. The method yields a parsimonious dynamical representation via a minimal set of Koopman eigenfunctions and unit-velocity measurements, enabling data interpolation, compression, denoising, retrieval, and improved neural network interpretability. The results on illustrative examples demonstrate accurate recovery of intrinsic structure and practical utility across multiple data-processing tasks.

Abstract

Measuring the Data analytically finds the intrinsic manifold in big data. First, Optimal Transport generates the tangent space at each data point from which the intrinsic dimension is revealed. Then, the Koopman Dimensionality Reduction procedure derives a nonlinear transformation from the data to the intrinsic manifold. Measuring the data procedure is presented here, backed up with encouraging results.

Measuring the Data

TL;DR

Measuring the Data addresses the challenge of analytically determining the intrinsic dimension of sparse, nonlinear data. It combines Optimal Transport to generate parametric curves on the data manifold and Koopman Regularization to derive a nonlinear mapping to the intrinsic coordinates, leveraging the fact that the tangent space at a data point is isomorphic to and that the Koopman eigenfunction space is finite dimensional. The method yields a parsimonious dynamical representation via a minimal set of Koopman eigenfunctions and unit-velocity measurements, enabling data interpolation, compression, denoising, retrieval, and improved neural network interpretability. The results on illustrative examples demonstrate accurate recovery of intrinsic structure and practical utility across multiple data-processing tasks.

Abstract

Measuring the Data analytically finds the intrinsic manifold in big data. First, Optimal Transport generates the tangent space at each data point from which the intrinsic dimension is revealed. Then, the Koopman Dimensionality Reduction procedure derives a nonlinear transformation from the data to the intrinsic manifold. Measuring the data procedure is presented here, backed up with encouraging results.

Paper Structure

This paper contains 29 sections, 24 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: Data Manifold – An illustration of “Measuring the Data”. At each point, we find a set of parametric curves representing data interpolation done by optimal transport. From these curves, find a tangent bundle at the data points. Then, extract the intrinsic dimension of the data manifold. Find the intrinsic coordinates. In the image, two neighborhoods (zoomed in) illustrate the concept of parametric curves, tangent bundles, and intrinsic dimension.
  • Figure 2: “Measuring the Data” Flowchart –- Step 1: Create parametric curves from a data point to its immediate neighbors with optimal transport. Step 2: From these curves, find the tangent bundle at the intersection data points. Step 3: Find the intrinsic dimension from the tangent bundle. Step 4: Find the intrinsic coordinate from differential isomorphism from the data manifold to $\mathbb{R}^M$.
  • Figure 3: Data Set -- A set of $35$ sampled Gaussian functions with different means and variance. Each function is sampled uniformly $1001$ times.
  • Figure 4: Dimensionality Reduction with Measuring the Data Dimensionality Reduction with Koopman Regularization and Measuring the Data. Each Gaussian function (a vector in $\mathbb{R}^{1001}$) is mapped to two-dimensional space without information loss. The mapping is denoted with the corresponding color. The mean moves right in the data set and correspondingly the mapped points move anti-clockwise, and the distribution is relative to the distance from the origin.