Table of Contents
Fetching ...

Density Estimation via Binless Multidimensional Integration

Matteo Carli, Alex Rodriguez, Alessandro Laio, Aldo Glielmo

TL;DR

Density estimation in high dimensions is challenging due to the curse of dimensionality and the need for robust, data-efficient methods. The paper introduces Binless Multidimensional Thermodynamic Integration (BMTI), a nonparametric approach that estimates the negative log-density by measuring log-density differences between neighboring points and integrating these differences on an adaptive, manifold-aware neighbourhood graph. BMTI derives a maximum-likelihood formulation yielding a linear system for the log-density values, and provides a principled way to quantify uncertainties through a covariance structure of the differences, while also offering approximate and regularised variants to handle disconnected graphs. Through extensive synthetic and realistic datasets, BMTI demonstrates improved accuracy and smoothness over state-of-the-art estimators across dimensionalities up to at least 20, highlighting its data efficiency and robustness for applications in physics and chemistry where free-energy landscapes are essential.

Abstract

We introduce the Binless Multidimensional Thermodynamic Integration (BMTI) method for nonparametric, robust, and data-efficient density estimation. BMTI estimates the logarithm of the density by initially computing log-density differences between neighbouring data points. Subsequently, such differences are integrated, weighted by their associated uncertainties, using a maximum-likelihood formulation. This procedure can be seen as an extension to a multidimensional setting of the thermodynamic integration, a technique developed in statistical physics. The method leverages the manifold hypothesis, estimating quantities within the intrinsic data manifold without defining an explicit coordinate map. It does not rely on any binning or space partitioning, but rather on the construction of a neighbourhood graph based on an adaptive bandwidth selection procedure. BMTI mitigates the limitations commonly associated with traditional nonparametric density estimators, effectively reconstructing smooth profiles even in high-dimensional embedding spaces. The method is tested on a variety of complex synthetic high-dimensional datasets, where it is shown to outperform traditional estimators, and is benchmarked on realistic datasets from the chemical physics literature.

Density Estimation via Binless Multidimensional Integration

TL;DR

Density estimation in high dimensions is challenging due to the curse of dimensionality and the need for robust, data-efficient methods. The paper introduces Binless Multidimensional Thermodynamic Integration (BMTI), a nonparametric approach that estimates the negative log-density by measuring log-density differences between neighboring points and integrating these differences on an adaptive, manifold-aware neighbourhood graph. BMTI derives a maximum-likelihood formulation yielding a linear system for the log-density values, and provides a principled way to quantify uncertainties through a covariance structure of the differences, while also offering approximate and regularised variants to handle disconnected graphs. Through extensive synthetic and realistic datasets, BMTI demonstrates improved accuracy and smoothness over state-of-the-art estimators across dimensionalities up to at least 20, highlighting its data efficiency and robustness for applications in physics and chemistry where free-energy landscapes are essential.

Abstract

We introduce the Binless Multidimensional Thermodynamic Integration (BMTI) method for nonparametric, robust, and data-efficient density estimation. BMTI estimates the logarithm of the density by initially computing log-density differences between neighbouring data points. Subsequently, such differences are integrated, weighted by their associated uncertainties, using a maximum-likelihood formulation. This procedure can be seen as an extension to a multidimensional setting of the thermodynamic integration, a technique developed in statistical physics. The method leverages the manifold hypothesis, estimating quantities within the intrinsic data manifold without defining an explicit coordinate map. It does not rely on any binning or space partitioning, but rather on the construction of a neighbourhood graph based on an adaptive bandwidth selection procedure. BMTI mitigates the limitations commonly associated with traditional nonparametric density estimators, effectively reconstructing smooth profiles even in high-dimensional embedding spaces. The method is tested on a variety of complex synthetic high-dimensional datasets, where it is shown to outperform traditional estimators, and is benchmarked on realistic datasets from the chemical physics literature.
Paper Structure (63 sections, 74 equations, 12 figures, 1 table)

This paper contains 63 sections, 74 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: The BMTI method Panels A to D illustrate of the 4 steps, described in Sec. \ref{['ssec:BMTI_deltaF']}, needed to construct the BMTI log-likelihood: estimating the intrinsic dimension $d$, adaptive neighbourhoods selection and the neighbourhood graph, NLD gradients $\hat{g}_i$, and finally NLD differences $\hat{\delta F}$ estimation. Panel E illustrates the reconstruction of the NLD starting from measurements of NLD differences as described in Sec. \ref{['ssec:BMTI_deltaF']}. In this illustration the NLD $\hat{F}_i$ at point $i$ (blue dot) is computed by taking into consideration $\hat{\delta F}$ contributions from 4 neighbours (green, orange, red, and yellow dots). The contributions push for increasing (upward arrows) or decreasing (downward arrows) the $\hat{F}_i$ value.
  • Figure 2: Accuracy in the estimation of $\boldsymbol{\hat{\delta F}}$ and its error. Density scatter plots of true vs estimated $\delta F$'s for 6 test datasets. The insets show the distribution of the standardised variables $(\hat{\delta F_{ij}} - \delta F_{ij})/\varepsilon_{ij}$ in blue, and a standard normal PDF in red; the agreement between the two demonstrate the accuracy of error estimates.
  • Figure 3: BMTI performance on various datasets.Top: scatter plots of estimated vs GT negative log-densities for BMTI and GKDE on 4 datasets of increasing intrinsic dimensionality. Bottom: Running averages of the absolute error of $\hat{F}$ as a function of the GT value of $F$ for BMTI and other baseline methods; the insets show zoomed-out versions when the error is too large to be visualised in a single graph.
  • Figure 4: A: BMTI smoothness and accuracy$\hat{F}$ along the minimum energy path connecting the two main minima of a 2d Mueller-Brown potential for various methods. The inset depicts the dataset used in the analysis and, as a red curve, the minimum energy path. B: BMTI data-efficiency Mean absolute error of various nonparametric methods as a function of the number of training points for the 6-dimensional dataset. Points in the plot are computed as mean MAE over 3 different runs. The standard deviations are very small even with a few hundred points, so they are not plotted.
  • Figure 5: Time scaling: single CPU training times measured in seconds as a function of sample size for the 6-dimensional dataset in the case of uncorrelated $\delta F$'s illustrated in Sec. \ref{['sssec:BMTI_support_approx_inverse_C_gCorr']} of the SM.
  • ...and 7 more figures