Table of Contents
Fetching ...

Nearest-Neighbours Estimators for Conditional Mutual Information

Jake Witter, Conor Houghton

TL;DR

The paper tackles the data-hungry nature of conditional mutual information estimation by introducing a metric-space, Kozachenko-Leonenko–style nearest-neighbor estimator for $I(X,Y|Z)$ that relies on local volume counts rather than coordinate-based densities. A bias-correction term $I_b(h)$ is derived via a hypergeometric model, and the estimator is optimized over the smoothing parameter $h$ to balance bias and variance. Compared to the KSG estimator, the new method is coordinate-free and provides a practical bias-correction framework, demonstrated through simulations on a simple Markov tree and a transfer-entropy–focused XY-model, where it achieves closer-to-ground-truth estimates with far less data. The approach broadens applicability to high-dimensional or non-Euclidean data, offering a scalable, model-free tool for information-theoretic analysis and causal inference in data science and beyond.

Abstract

The conditional mutual information quantifies the conditional dependence of two random variables. It has numerous applications; it forms, for example, part of the definition of transfer entropy, a common measure of the causal relationship between time series. It does, however, require a lot of data to estimate accurately and suffers the curse of dimensionality, limiting its application in machine learning and data science. However, the Kozachenko-Leonenko approach can address this problem: it is possible, in this approach to define a nearest-neighbour estimator which depends only on the distance between data points and not on the dimension of the data. Furthermore, the bias can be calculated analytically for this estimator. Here this estimator is described and is tested on simulated data.

Nearest-Neighbours Estimators for Conditional Mutual Information

TL;DR

The paper tackles the data-hungry nature of conditional mutual information estimation by introducing a metric-space, Kozachenko-Leonenko–style nearest-neighbor estimator for that relies on local volume counts rather than coordinate-based densities. A bias-correction term is derived via a hypergeometric model, and the estimator is optimized over the smoothing parameter to balance bias and variance. Compared to the KSG estimator, the new method is coordinate-free and provides a practical bias-correction framework, demonstrated through simulations on a simple Markov tree and a transfer-entropy–focused XY-model, where it achieves closer-to-ground-truth estimates with far less data. The approach broadens applicability to high-dimensional or non-Euclidean data, offering a scalable, model-free tool for information-theoretic analysis and causal inference in data science and beyond.

Abstract

The conditional mutual information quantifies the conditional dependence of two random variables. It has numerous applications; it forms, for example, part of the definition of transfer entropy, a common measure of the causal relationship between time series. It does, however, require a lot of data to estimate accurately and suffers the curse of dimensionality, limiting its application in machine learning and data science. However, the Kozachenko-Leonenko approach can address this problem: it is possible, in this approach to define a nearest-neighbour estimator which depends only on the distance between data points and not on the dimension of the data. Furthermore, the bias can be calculated analytically for this estimator. Here this estimator is described and is tested on simulated data.
Paper Structure (12 sections, 29 equations, 5 figures)

This paper contains 12 sections, 29 equations, 5 figures.

Figures (5)

  • Figure 1: An illustration of the different intersections. In this cartoon the three sets of outcomes, $\mathcal{X}$, $\mathcal{Y}$ and $\mathcal{Z}$, are marked $X$, $Y$ and $Z$; here these are one-dimensional whereas, of course, they can be high-dimensional or even spaces without a dimension. In A the three balls around the seed point, $B_{X}$, $B_{Y}$ and $B_{Z}$, are drawn in blue, yellow and red. Each of these balls contain $h$ points, this determines their radius. In B attention is restricted $B_{Z}$; the intersection this has with $B_{X}$ gives $B_{XZ}$ and with $B_{Y}$ gives $B_{YZ}$. All three intersect to give the cyan region $B_{XYZ}$. C shows $B_{Z}$ again, from above; the points are also marked, with a star for the seed point. Here, $h_{XYZ}=4$, $h_{XZ}=6$, $h_{YZ}=9$ and all the points lie in $B_{Z}$ so $h$ in this illustration is 22.
  • Figure 2: The Markov network for our simulated data. This illustrates the relationship between the three observed variables $X$, $Y$ and $Z$ and the unobserved variable $W$. $X$ and $Y$ are conditionally independent given $W$; since $Z$ is a noisy version of $W$ the mutual independence of $X$ and $Y$ given $Z$ depends on the amount of that noise, $\sigma_z$.
  • Figure 3: Results from one-dimensional Markov tree. In a the relationship between the estimated conditional mutual information and $\sigma_z$ is shown. The shaded area shows the middle 50% of estimates. Here, the KSG and new methods use 3500 points, while the histogram uses $5 \times 10^6$ points to establish the ground truth. b shows how this estimate scales over the number of data points used in the estimate. Again, the histogram method uses $5 \times 10^6$ points. In c, the relationship between the $k$ used in the KSG estimates, and the estimated information is shown. This demonstrates that choice of $k$ does strongly influence the estimated information. Note the change in scale in the vertical axis between a and the other two rows.
  • Figure 4: Results from two dimensional Markov tree In a the relationship between the estimated conditional mutual information and $\sigma_z$ is shown. The shaded area shows the middle 50% of estimates. Here, the KSG and new methods use 3500 points, while the histogram uses $5 \times 10^6$ points to establish the ground truth. b shows how this estimate scales over the number of data points used in the estimate. Again, the histogram method uses $5 \times 10^6$ points. In c, the relationship between the $k$ used in the KSG estimates, and the estimated information is shown. This demonstrates that choice of $k$ does strongly influence the estimated information. Note the change in scale in the vertical axis between c and the other two rows.
  • Figure 5: Results for calculating transfer entropy from the XY model. In a the relationship between number of samples used, and transfer entropy estimate is shown. Figures show for past lengths of one (left) and two (right). Here, distance is fixed at one. The dotted histogram line is after the histogram estimate has converged, here using $5 \times 10^6$ samples. b makes a similar comparison, but varying distances of one (left) and two (right), with fixed past length. Again, the dotted line represents the converged histogram method, using a far greater, $5 \times 10^6$, number of samples.