Generative modeling of density regression through tree flows

Zhuoqun Wang; Naoki Awaya; Li Ma

Generative modeling of density regression through tree flows

Zhuoqun Wang, Naoki Awaya, Li Ma

TL;DR

A flow-based generative model tailored for the density regression task on tabular data is proposed and the utility of the method's generative ability is demonstrated through an application to generating synthetic longitudinal microbiome compositional data based on training the authors' flow on a publicly available microbiome study.

Abstract

A common objective in the analysis of tabular data is estimating the conditional distribution (in contrast to only producing predictions) of a set of "outcome" variables given a set of "covariates", which is sometimes referred to as the "density regression" problem. Beyond estimation on the conditional distribution, the generative ability of drawing synthetic samples from the learned conditional distribution is also desired as it further widens the range of applications. We propose a flow-based generative model tailored for the density regression task on tabular data. Our flow applies a sequence of tree-based piecewise-linear transforms on initial uniform noise to eventually generate samples from complex conditional densities of (univariate or multivariate) outcomes given the covariates and allows efficient analytical evaluation of the fitted conditional density on any point in the sample space. We introduce a training algorithm for fitting the tree-based transforms using a divide-and-conquer strategy that transforms maximum likelihood training of the tree-flow into training a collection of binary classifiers--one at each tree split--under cross-entropy loss. We assess the performance of our method under out-of-sample likelihood evaluation and compare it with a variety of state-of-the-art conditional density learners on a range of simulated and real benchmark tabular datasets. Our method consistently achieves comparable or superior performance at a fraction of the training and sampling budget. Finally, we demonstrate the utility of our method's generative ability through an application to generating synthetic longitudinal microbiome compositional data based on training our flow on a publicly available microbiome study.

Generative modeling of density regression through tree flows

TL;DR

Abstract

Paper Structure (25 sections, 30 equations, 8 figures, 9 tables, 2 algorithms)

This paper contains 25 sections, 30 equations, 8 figures, 9 tables, 2 algorithms.

Introduction
A conditional flow with tree-based transforms
A tree ensemble-based approximation to conditional densities
Fitting a single covariate-dependent tree-CDF through binary classification
Additional technical improvements
Experiments
Real-world tasks with univariate outcomes
Simulation examples for bivariate outcomes
Real-world tasks with multivariate outcomes
Data generation
Conclusion
Tree-CDF and its inverse
Justification of optimization procedure in \ref{['sec:single_tree']}
Algorithms for training the tree flow and a single tree
Time complexity analysis
...and 10 more sections

Figures (8)

Figure 1: Comparison on UCI benchmark datasets as measured by log-likelihood of test set (mean $\pm$ standard error). Marker color indicates relative performance: blue indicates our method outperforms the alternative method, while red indicates the instances when our method underperforms, and black denotes comparable performance within the standard error bounds. The results of NGBoost duan2020ngboost, RoNGBaren2019rongba, and TreeFlowwielopolski2023treeflow are obtained from their original papers. The results of PGBMpgbm are obtained from wielopolski2023treeflow. The results of Dropout, LV, MDN, MF, RNF are obtained from bayesiannf.
Figure 2: Ground truth conditional density of simulation examples with bivariate outcome
Figure 3: Training time of our method on a single CPU core versus $ndq$ on log-log scale for 9 UCI datasets—boston, concrete, power, yacht, naval, kin8nm, protein, air, and skillcraft. Points are annotated with $(n,d,q)$ values. A linear trend with slope 1 (gray line) indicates $O(ndq)$ complexity.
Figure 4: Principal coordinate analysis (PCoA) of Bray-Curtis similarity of training (upper row) and simulated (lower row) samples. The color of the points indicates the age (in days) of the infant, which is the covariate in this example.
Figure 5: Half Gaussian. From top to bottom, the rows correspond to $x=-0.75,-0.25, 0.25, 0.75$ respectively.
...and 3 more figures

Generative modeling of density regression through tree flows

TL;DR

Abstract

Generative modeling of density regression through tree flows

Authors

TL;DR

Abstract

Table of Contents

Figures (8)