Table of Contents
Fetching ...

Generative modeling of density regression through tree flows

Zhuoqun Wang, Naoki Awaya, Li Ma

TL;DR

A flow-based generative model tailored for the density regression task on tabular data is proposed and the utility of the method's generative ability is demonstrated through an application to generating synthetic longitudinal microbiome compositional data based on training the authors' flow on a publicly available microbiome study.

Abstract

A common objective in the analysis of tabular data is estimating the conditional distribution (in contrast to only producing predictions) of a set of "outcome" variables given a set of "covariates", which is sometimes referred to as the "density regression" problem. Beyond estimation on the conditional distribution, the generative ability of drawing synthetic samples from the learned conditional distribution is also desired as it further widens the range of applications. We propose a flow-based generative model tailored for the density regression task on tabular data. Our flow applies a sequence of tree-based piecewise-linear transforms on initial uniform noise to eventually generate samples from complex conditional densities of (univariate or multivariate) outcomes given the covariates and allows efficient analytical evaluation of the fitted conditional density on any point in the sample space. We introduce a training algorithm for fitting the tree-based transforms using a divide-and-conquer strategy that transforms maximum likelihood training of the tree-flow into training a collection of binary classifiers--one at each tree split--under cross-entropy loss. We assess the performance of our method under out-of-sample likelihood evaluation and compare it with a variety of state-of-the-art conditional density learners on a range of simulated and real benchmark tabular datasets. Our method consistently achieves comparable or superior performance at a fraction of the training and sampling budget. Finally, we demonstrate the utility of our method's generative ability through an application to generating synthetic longitudinal microbiome compositional data based on training our flow on a publicly available microbiome study.

Generative modeling of density regression through tree flows

TL;DR

A flow-based generative model tailored for the density regression task on tabular data is proposed and the utility of the method's generative ability is demonstrated through an application to generating synthetic longitudinal microbiome compositional data based on training the authors' flow on a publicly available microbiome study.

Abstract

A common objective in the analysis of tabular data is estimating the conditional distribution (in contrast to only producing predictions) of a set of "outcome" variables given a set of "covariates", which is sometimes referred to as the "density regression" problem. Beyond estimation on the conditional distribution, the generative ability of drawing synthetic samples from the learned conditional distribution is also desired as it further widens the range of applications. We propose a flow-based generative model tailored for the density regression task on tabular data. Our flow applies a sequence of tree-based piecewise-linear transforms on initial uniform noise to eventually generate samples from complex conditional densities of (univariate or multivariate) outcomes given the covariates and allows efficient analytical evaluation of the fitted conditional density on any point in the sample space. We introduce a training algorithm for fitting the tree-based transforms using a divide-and-conquer strategy that transforms maximum likelihood training of the tree-flow into training a collection of binary classifiers--one at each tree split--under cross-entropy loss. We assess the performance of our method under out-of-sample likelihood evaluation and compare it with a variety of state-of-the-art conditional density learners on a range of simulated and real benchmark tabular datasets. Our method consistently achieves comparable or superior performance at a fraction of the training and sampling budget. Finally, we demonstrate the utility of our method's generative ability through an application to generating synthetic longitudinal microbiome compositional data based on training our flow on a publicly available microbiome study.
Paper Structure (25 sections, 30 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 25 sections, 30 equations, 8 figures, 9 tables, 2 algorithms.

Figures (8)

  • Figure 1: Comparison on UCI benchmark datasets as measured by log-likelihood of test set (mean $\pm$ standard error). Marker color indicates relative performance: blue indicates our method outperforms the alternative method, while red indicates the instances when our method underperforms, and black denotes comparable performance within the standard error bounds. The results of NGBoost duan2020ngboost, RoNGBaren2019rongba, and TreeFlowwielopolski2023treeflow are obtained from their original papers. The results of PGBMpgbm are obtained from wielopolski2023treeflow. The results of Dropout, LV, MDN, MF, RNF are obtained from bayesiannf.
  • Figure 2: Ground truth conditional density of simulation examples with bivariate outcome
  • Figure 3: Training time of our method on a single CPU core versus $ndq$ on log-log scale for 9 UCI datasets—boston, concrete, power, yacht, naval, kin8nm, protein, air, and skillcraft. Points are annotated with $(n,d,q)$ values. A linear trend with slope 1 (gray line) indicates $O(ndq)$ complexity.
  • Figure 4: Principal coordinate analysis (PCoA) of Bray-Curtis similarity of training (upper row) and simulated (lower row) samples. The color of the points indicates the age (in days) of the infant, which is the covariate in this example.
  • Figure 5: Half Gaussian. From top to bottom, the rows correspond to $x=-0.75,-0.25, 0.25, 0.75$ respectively.
  • ...and 3 more figures