Table of Contents
Fetching ...

Density Estimation via Measure Transport: Outlook for Applications in the Biological Sciences

Vanessa Lopez-Marrero, Patrick R. Johnstone, Gilchan Park, Xihaier Luo

TL;DR

This work investigates density estimation from limited data via measure transport, focusing on triangular transport maps to unify processing of Gaussian and non-Gaussian distributions. By learning adaptive transport maps on randomized data subsets, the authors reveal dominant dependence structures among genes and demonstrate potential for scientific discovery in radiation biology. The approach enables explicit density evaluation, efficient sampling, and integration of prior biological knowledge (e.g., KEGG pathways) to improve classification and to extract biologically meaningful dependencies. Overall, the framework offers a principled, data-efficient tool for probabilistic modeling and hypothesis generation in complex biological systems with scarce training data.

Abstract

One among several advantages of measure transport methods is that they allow for a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.

Density Estimation via Measure Transport: Outlook for Applications in the Biological Sciences

TL;DR

This work investigates density estimation from limited data via measure transport, focusing on triangular transport maps to unify processing of Gaussian and non-Gaussian distributions. By learning adaptive transport maps on randomized data subsets, the authors reveal dominant dependence structures among genes and demonstrate potential for scientific discovery in radiation biology. The approach enables explicit density evaluation, efficient sampling, and integration of prior biological knowledge (e.g., KEGG pathways) to improve classification and to extract biologically meaningful dependencies. Overall, the framework offers a principled, data-efficient tool for probabilistic modeling and hypothesis generation in complex biological systems with scarce training data.

Abstract

One among several advantages of measure transport methods is that they allow for a unified framework for processing and analysis of data distributed according to a wide class of probability measures. Within this context, we present results from computational studies aimed at assessing the potential of measure transport techniques, specifically, the use of triangular transport maps, as part of a workflow intended to support research in the biological sciences. Scenarios characterized by the availability of limited amount of sample data, which are common in domains such as radiation biology, are of particular interest. We find that when estimating a distribution density function given limited amount of sample data, adaptive transport maps are advantageous. In particular, statistics gathered from computing series of adaptive transport maps, trained on a series of randomly chosen subsets of the set of available data samples, leads to uncovering information hidden in the data. As a result, in the radiation biology application considered here, this approach provides a tool for generating hypotheses about gene relationships and their dynamics under radiation exposure.
Paper Structure (16 sections, 37 equations, 18 figures, 2 tables, 3 algorithms)

This paper contains 16 sections, 37 equations, 18 figures, 2 tables, 3 algorithms.

Figures (18)

  • Figure 1: The map $T\!:\! \mathbb{R}^{3} \!\rightarrow\! \mathbb{R}^{3}$ transports samples from $\nu_{\rho}$, a three-variate standard normal (i.e., Gaussian) measure $\mathcal{N}(0,I)$ with density $\rho(x) = (2\pi)^{-3/2}\,e^{{-||x||^{2}}/2}$, to samples from $\nu_{\psi}$, a three-variate measure, with multi-modal density $\psi$. The density $\rho$ is the pullback $T^{\sharp}\psi$ of the density $\psi$, and so we have $\rho(x) = T^{\sharp}\psi(x) = \psi(T(x)) \, \det( J_{T}(x))$, where $J_{T}$ is the Jacobian matrix of $T$. Conversely, the map $T^{-1}$ transports samples from $\nu_{\psi}$ to samples from $\nu_{\rho}$. The density $\psi$ is the pullback $(T^{-1})^{\sharp}\rho$ of the density $\rho$, and so $\psi(y) = (T^{-1})^{\sharp}\rho(y) = \rho(T^{-1}(y)) \, \det( J_{T^{-1}}(y))$. Contour plots of the corresponding one- and two-dimensional marginal distribution densities also are shown.
  • Figure 2: Computational workflow for probabilistic modeling, inference, and statistical analysis. To account for availability of limited number of data samples, model training can be augmented with prior knowledge. Such prior knowledge may come from large language models or human experts, for instance. Measure transport techniques allow for a unified framework for density estimation and processing of data exhibiting diverse characteristics.
  • Figure 3: Pictorial representation of triangular transport maps sparsity patterns. To simplify notation in Subfigures \ref{['fig:illustration_dense_map']}--\ref{['fig:illustration_diagonal_map']}, we denote $S \equiv T^{-1}$ for the transport map $T^{-1}$ from \ref{['eq:triangular_TM_inverse']}. In this example, $S\!:\!\mathbb{R}^{5} \!\rightarrow\! \mathbb{R}^{5}$. In each plot, the horizontal axis indexes the random variables $\{\mathcal{{Y}}_{j}\}_{j=1}^{5}$ and the vertical axis indexes the map components $\{S_{i}\}_{i=1}^{5}$. A square appears at the intersection of grid point $(j,i)$ if the $i$-th map component $S_{i}$ depends on the value $y_{j}$ of the $j$-th random variable $\mathcal{{Y}}_{j}$. For any given map component, the set of active variables is the set of random variables that the map component depends on. Collectively, the active variables define a sparsity pattern for the transport map.
  • Figure 4: Approximations to univariate density functions via triangular transport maps. In each of the Subfigures \ref{['fig:example_1d_gaussian']} and \ref{['fig:example_1d_non_gaussian']}, (i) the red curve is the reference density $\rho$ -- which is the density of a standard normal (i.e., Gaussian) measure $\mathcal{N}(0,1)$, (ii) a histogram of samples from the target measure $\nu_{\psi}$ is depicted with black lines, (iii) the blue curve is the transport map (TM) approximation to the target density $\psi$, and (iv) the cyan curve is the density for the normal (i.e., Gaussian) measure $\mathcal{N}(\mu,\Sigma)$ with mean and standard deviation being that of the sample data. For Subfigure \ref{['fig:example_1d_gaussian']}, the data set was created by drawing samples from a normal (i.e., Gaussian) measure $\mathcal{N}(\mu,\Sigma)$ with $\mu \ne 0$ and $\Sigma \ne 1$. As can be seen, the TM approximation to the target density (blue curve) agrees with that of the true density (cyan curve). The samples for the data set from Subfigure \ref{['fig:example_1d_non_gaussian']} were drawn from a multi-modal target measure $\nu_{\psi}$. The resulting TM approximation to the target density (blue curve) reflects this and follows the data histogram (black curve) more closely, as opposed to the density for $\mathcal{N}(\mu,\Sigma)$ (cyan curve).
  • Figure 5: Transport map (TM) approximations to class-conditional densities for the data set UCI_banknote_authentication_267. Subfigure \ref{['fig:bn_class0_ATM']}: Class 0 marginal distribution densities from data (1st subplot, far left) and three different TM approximations (2nd through 4th subplots), with an increasing number of terms in the basis function expansions \ref{['eq:f_expansion']} for the TM components. Clearly, the data density distribution (1st subplot in Figure \ref{['fig:bn_class0_ATM']}) is not Gaussian. The first TM approximation (2nd subplot in Figure \ref{['fig:bn_class0_ATM']}), with only one term in each basis function expansion \ref{['eq:f_expansion']}, approximates the target density (1st subplot in Figure \ref{['fig:bn_class0_ATM']}) poorly. The accuracy of the approximations improves (3rd and 4th subplots in Figure \ref{['fig:bn_class0_ATM']}, compared to 1st subplot in Figure \ref{['fig:bn_class0_ATM']}) with an increased number of terms in the basis function expansions \ref{['eq:f_expansion']}. Subfigure \ref{['fig:bn_class1_ATM']}: As in Subfigure \ref{['fig:bn_class0_ATM']}, but for class 1. Subfigure \ref{['fig:bn_classification_ATM']}: Left: reference probability density, which is a standard multivariate normal for all TM approximations. Afterwards: confusion matrices showing improved classification result as the accuracy of the TM density approximations increases. The top row confusion matrices are from training data and the bottom row ones are from inference data. Subfigure \ref{['fig:bn_classification_matlab']}: As the data is not Gaussian, a naive Bayes classifier (left) performs poorly. A support vector machine (middle) and neural network (right) classifier perform as well as the classification resulting from the most accurate TMs approximation (far right, Subfigures \ref{['fig:bn_class0_ATM']}-\ref{['fig:bn_classification_ATM']}).
  • ...and 13 more figures