Noncommutative Model Selection for Data Clustering and Dimension Reduction Using Relative von Neumann Entropy
Araceli Guzmán-Tristán, Antonio Rieser
TL;DR
The paper addresses parameter-free unsupervised clustering and dimension reduction for data sampled from a metric space by constructing a family of graphs G_r with edge weights given by ambient distances and selecting the scale hat r that maximizes a noncommutative information measure. A relative von Neumann entropy between short-time and long-time heat operators guides model selection, after which the graph Laplacian's kernel or leading eigenvectors provide clustering or embedding, respectively. Key contributions include introducing ambient-distance graphs, formulating a noncommutative entropy-based model selection criterion, and validating superior clustering performance on geometry-rich data and COIL-20 images, along with effective diffusion-based dimension reduction. The approach yields a fully data-driven spectral framework that avoids manual neighborhood sizing and leverages noncommutative information theory, with potential applicability to complex geometric data in various domains.
Abstract
We propose a pair of completely data-driven algorithms for unsupervised classification and dimension reduction, and we empirically study their performance on a number of data sets, both simulated data in three-dimensions and images from the COIL-20 data set. The algorithms take as input a set of points sampled from a uniform distribution supported on a metric space, the latter embedded in an ambient metric space, and they output a clustering or reduction of dimension of the data. They work by constructing a natural family of graphs from the data and selecting the graph which maximizes the relative von Neumann entropy of certain normalized heat operators constructed from the graphs. Once the appropriate graph is selected, the eigenvectors of the graph Laplacian may be used to reduce the dimension of the data, and clusters in the data may be identified with the kernel of the associated graph Laplacian. Notably, these algorithms do not require information about the size of a neighborhood or the desired number of clusters as input, in contrast to popular algorithms such as $k$-means, and even more modern spectral methods such as Laplacian eigenmaps, among others. In our computational experiments, our clustering algorithm outperforms $k$-means clustering on data sets with non-trivial geometry and topology, in particular data whose clusters are not concentrated around a specific point, and our dimension reduction algorithm is shown to work well in several simple examples.
