Table of Contents
Fetching ...

The bliss of dimensionality: how an unsupervised criterion identifies optimal low-resolution representations of high-dimensional datasets

Margherita Mele, Daniel Campos Moreno, Raffaello Potestio

Abstract

Selecting the optimal resolution for discretizing high-dimensional data is a central problem in physics and data analysis, particularly in unsupervised settings where the underlying distribution is unknown. The Relevance-Resolution (Res-Rel) framework addresses this issue through an information-theoretic trade-off between descriptive detail and statistical reliability. Here we provide a systematic validation of this approach by comparing its characteristic optima--maximum relevance and the -1 slope (information-theoretic) point--with the discretization that minimizes the Kullback-Leibler divergence from a known or physically motivated ground truth distribution. Across unstructured and structured synthetic datasets, Gaussian clones of MNIST, and molecular dynamics simulations of the alanine dipeptide, we find that as the dimensionality or informative content increases the KL-optimal discretization consistently lies within the Res-Rel optimality region. Furthermore, in high-dimensional regimes the -1 slope criterion closely matches the KL divergence minimum. These results establish the quantitative consistency of unsupervised information-theoretic selection with distribution-based optimality.

The bliss of dimensionality: how an unsupervised criterion identifies optimal low-resolution representations of high-dimensional datasets

Abstract

Selecting the optimal resolution for discretizing high-dimensional data is a central problem in physics and data analysis, particularly in unsupervised settings where the underlying distribution is unknown. The Relevance-Resolution (Res-Rel) framework addresses this issue through an information-theoretic trade-off between descriptive detail and statistical reliability. Here we provide a systematic validation of this approach by comparing its characteristic optima--maximum relevance and the -1 slope (information-theoretic) point--with the discretization that minimizes the Kullback-Leibler divergence from a known or physically motivated ground truth distribution. Across unstructured and structured synthetic datasets, Gaussian clones of MNIST, and molecular dynamics simulations of the alanine dipeptide, we find that as the dimensionality or informative content increases the KL-optimal discretization consistently lies within the Res-Rel optimality region. Furthermore, in high-dimensional regimes the -1 slope criterion closely matches the KL divergence minimum. These results establish the quantitative consistency of unsupervised information-theoretic selection with distribution-based optimality.
Paper Structure (4 sections, 2 equations, 4 figures)

This paper contains 4 sections, 2 equations, 4 figures.

Figures (4)

  • Figure 1: Comparison between Relevance--Resolution and Kullback--Leibler optimal representations for unstructured synthetic data. Ratio $n_{\mathrm{KL}}/n_{\mathrm{opt}}$ as a function of the data dimensionality $N$ for unstructured synthetic datasets. In each panel, black squares correspond to $n_{\mathrm{opt}}^{\mathrm{MR}}$ (maximum relevance), while grey circles correspond to $n_{\mathrm{opt}}^{\mathrm{IT}}$ ($-1$ slope point). The red dashed line indicates equality between the two estimates. Error bars represent the standard deviation over $50$ independent realisations. Panel (a) shows low-dimensional datasets analysed using $N$-dimensional histograms, generated from Gaussian, Beta, Exponential, and correlated Gaussian distributions. Panel (b) shows the corresponding analysis for IID Gaussian data spanning from low to high dimensions, where representations are constructed using UPGMA clustering.
  • Figure 2: Comparison between Relevance--Resolution and Kullback--Leibler optimal representations for structured synthetic data. Ratio $n_{\mathrm{KL}}/n_{\mathrm{opt}}$ as a function of the number of informative dimensions $m$ for structured synthetic datasets with latent Gaussian mixture structure. In each panel, black squares correspond to $n_{\mathrm{opt}}^{\mathrm{MR}}$ (maximum relevance), while grey circles correspond to $n_{\mathrm{opt}}^{\mathrm{IT}}$ ($-1$ slope point). The red dashed line indicates equality between the two estimates. Error bars represent the standard deviation over $50$ independent realisations. Panels (a) and (b) show equal-weight mixtures for $K=2$ and $K=5$, while panels (c,d) show the corresponding unequal-weight cases with weights $[0.66,0.33]$ and $[0.34,0.27,0.19,0.13,0.07]$, respectively. Different rows correspond to increasing values of the mixture standard deviation $\sigma_{\mathrm{M}}$, as indicated. For each row, the right-hand subpanels show two-dimensional projections of representative datasets generated with $m=2$ for the corresponding parameter values. For each value of $K$, the random seed used to generate the data is fixed across different scenarios, ensuring that the relative positions of the mixture means are identical.
  • Figure 3: Comparison between Relevance--Resolution and Kullback--Leibler optimal representations for Gaussian clones of MNIST. Boxplots show the ratio $n_{\mathrm{KL}} / n_{\mathrm{opt}}$ evaluated at two characteristic points of the Relevance--Resolution curve: the $-1$-slope point $n_{\mathrm{opt}}^{\mathrm{IT}}$ (information-theoretic optimum) and the maximum-relevance point $n_{\mathrm{opt}}^{\mathrm{MR}}$. The horizontal red dashed line marks the ideal value $n_{\mathrm{KL}} / n_{\mathrm{opt}} = 1$, corresponding to perfect agreement between the two criteria, while blue horizontal lines indicate the median of each distribution. Panels (a,b) correspond to mixtures with $K = 2$ components, with equal weights in (a) and unequal weights $[0.66,\,0.33]$ in (b). Panels (c,d) show mixtures with $K = 5$ components, with equal weights in (c) and unequal weights $[0.34,\,0.27,\,0.19,\,0.13,\,0.07]$ in (d).
  • Figure 4: Comparison between Relevance--Resolution and Kullback--Leibler optimal representations for Alanine dipeptide. (a) Ratio $n_{\mathrm{KL}} / n_{\mathrm{opt}}$ computed over ten independent molecular dynamics (MD) simulations. Grey circles denote the ratio evaluated at the maximum-relevance point $n_{\mathrm{opt}}^{\mathrm{MR}}$, while black squares correspond to the information-theoretic optimum $n_{\mathrm{opt}}^{\mathrm{IT}}$, defined as the $-1$-slope point of the Relevance--Resolution curve. The horizontal red dashed line marks the reference value $n_{\mathrm{KL}} / n_{\mathrm{opt}} = 1$, corresponding to perfect agreement between the Kullback--Leibler-optimal discretization and the Relevance--Resolution prediction. (b,c) Two-dimensional projections in the space of backbone dihedral angles $(\phi, \psi)$ for a representative trajectory. In panel (b), colors represent a two-dimensional histogram of the raw MD frames, providing an empirical estimate of the reference probability distribution. In panel (c), only the centroids of the clusters obtained by RMSD-based clustering at $n_{\mathrm{opt}}^{\mathrm{IT}}$ are shown. Colours are assigned from a two-dimensional histogram in $(\phi,\psi)$ constructed over these centroids, each weighted by the multiplicity of its corresponding cluster.