Clustering, Coding, and the Concept of Similarity
L. Thorne McCarty
TL;DR
This work presents a unified theory of clustering and coding by marrying a geometric, Riemannian framework with a probabilistic diffusion model. A potential function $U({\bf x})$ governs an invariant density via $e^{2U({\bf x})}$, while its gradient $\nabla U({\bf x})$ defines a dissimilarity metric $g_{ij}({\bf x})$ on a Frobenius-integral manifold, enabling a low-dimensional encoding that respects data density. The authors develop prototype coding by projecting diffusion dynamics onto a $k$-dimensional manifold, compute geodesic coordinate curves in the principal directions, and demonstrate the approach on Gaussian and curvilinear Gaussian experiments in ${\bf R}^{3}$, including a bimodal case. They derive explicit diffusion-coefficient formulas, connect to the Laplace-Beltrami operator on the manifold, and show how the strategy yields coordinates that reflect the probability mass, with potential relevance to manifold learning and density-aware embedding. Future work aims to extend to higher dimensions, estimate $\nabla U$ from data, and relate differential similarity to contemporary unsupervised representation learning.
Abstract
This paper develops a theory of clustering and coding which combines a geometric model with a probabilistic model in a principled way. The geometric model is a Riemannian manifold with a Riemannian metric, ${g}_{ij}({\bf x})$, which we interpret as a measure of dissimilarity. The probabilistic model consists of a stochastic process with an invariant probability measure which matches the density of the sample input data. The link between the two models is a potential function, $U({\bf x})$, and its gradient, $\nabla U({\bf x})$. We use the gradient to define the dissimilarity metric, which guarantees that our measure of dissimilarity will depend on the probability measure. Finally, we use the dissimilarity metric to define a coordinate system on the embedded Riemannian manifold, which gives us a low-dimensional encoding of our original data.
