Table of Contents
Fetching ...

Clustering, Coding, and the Concept of Similarity

L. Thorne McCarty

TL;DR

This work presents a unified theory of clustering and coding by marrying a geometric, Riemannian framework with a probabilistic diffusion model. A potential function $U({\bf x})$ governs an invariant density via $e^{2U({\bf x})}$, while its gradient $\nabla U({\bf x})$ defines a dissimilarity metric $g_{ij}({\bf x})$ on a Frobenius-integral manifold, enabling a low-dimensional encoding that respects data density. The authors develop prototype coding by projecting diffusion dynamics onto a $k$-dimensional manifold, compute geodesic coordinate curves in the principal directions, and demonstrate the approach on Gaussian and curvilinear Gaussian experiments in ${\bf R}^{3}$, including a bimodal case. They derive explicit diffusion-coefficient formulas, connect to the Laplace-Beltrami operator on the manifold, and show how the strategy yields coordinates that reflect the probability mass, with potential relevance to manifold learning and density-aware embedding. Future work aims to extend to higher dimensions, estimate $\nabla U$ from data, and relate differential similarity to contemporary unsupervised representation learning.

Abstract

This paper develops a theory of clustering and coding which combines a geometric model with a probabilistic model in a principled way. The geometric model is a Riemannian manifold with a Riemannian metric, ${g}_{ij}({\bf x})$, which we interpret as a measure of dissimilarity. The probabilistic model consists of a stochastic process with an invariant probability measure which matches the density of the sample input data. The link between the two models is a potential function, $U({\bf x})$, and its gradient, $\nabla U({\bf x})$. We use the gradient to define the dissimilarity metric, which guarantees that our measure of dissimilarity will depend on the probability measure. Finally, we use the dissimilarity metric to define a coordinate system on the embedded Riemannian manifold, which gives us a low-dimensional encoding of our original data.

Clustering, Coding, and the Concept of Similarity

TL;DR

This work presents a unified theory of clustering and coding by marrying a geometric, Riemannian framework with a probabilistic diffusion model. A potential function governs an invariant density via , while its gradient defines a dissimilarity metric on a Frobenius-integral manifold, enabling a low-dimensional encoding that respects data density. The authors develop prototype coding by projecting diffusion dynamics onto a -dimensional manifold, compute geodesic coordinate curves in the principal directions, and demonstrate the approach on Gaussian and curvilinear Gaussian experiments in , including a bimodal case. They derive explicit diffusion-coefficient formulas, connect to the Laplace-Beltrami operator on the manifold, and show how the strategy yields coordinates that reflect the probability mass, with potential relevance to manifold learning and density-aware embedding. Future work aims to extend to higher dimensions, estimate from data, and relate differential similarity to contemporary unsupervised representation learning.

Abstract

This paper develops a theory of clustering and coding which combines a geometric model with a probabilistic model in a principled way. The geometric model is a Riemannian manifold with a Riemannian metric, , which we interpret as a measure of dissimilarity. The probabilistic model consists of a stochastic process with an invariant probability measure which matches the density of the sample input data. The link between the two models is a potential function, , and its gradient, . We use the gradient to define the dissimilarity metric, which guarantees that our measure of dissimilarity will depend on the probability measure. Finally, we use the dissimilarity metric to define a coordinate system on the embedded Riemannian manifold, which gives us a low-dimensional encoding of our original data.

Paper Structure

This paper contains 13 sections, 10 theorems, 107 equations, 16 figures.

Key Result

Lemma 1

$w(t,{\bf x})$ is a solution to wCauchy if and only if $e^{U(\bf x)}w(t,{\bf x})$ is a solution to uCauchy with initial value $u(0,\cdot) = e^{U}f$.

Figures (16)

  • Figure 1: Contour plot for the surface of a quadratic potential at $U(x,y,z) = -2$.
  • Figure 2: The gradient vector field at $z = 0$ for the quadratic potential in Figure 1.
  • Figure 3: An integral manifold with a global coordinate system for the quadratic potential in Figure 1.
  • Figure 4: A coordinate system for the quadratic potential in Figure 1, based on commutative flows.
  • Figure 5: A coordinate system for the $\rho,\theta$ surface of the quadratic potential in Figure 1.
  • ...and 11 more figures

Theorems & Definitions (21)

  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Theorem 2
  • proof
  • Lemma 2
  • proof
  • Theorem 3
  • proof
  • ...and 11 more