Table of Contents
Fetching ...

Two-sample comparison through additive tree models for density ratios

Naoki Awaya, Yuliang Xu, Li Ma

TL;DR

This work proposes additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss, which allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting.

Abstract

The ratio of two densities provides a direct characterization of their differences. We consider the two-sample comparison problem by estimating this ratio given i.i.d. observations from two distributions. To this end, we propose additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss. The loss allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting. Moreover, the balancing loss resembles an exponential family kernel, and it can serve as a pseudo-likelihood with conjugate priors. This property enables generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). Our Bayesian strategy provides uncertainty quantification for the inferred density ratio, which is critical for applications involving high-dimensional and data-limited distributions with potentially substantial uncertainty. We further show connections of the balancing loss to the exponential loss in binary classification and to the variational form of f-divergence, particularly the squared Hellinger distance. Numerical experiments demonstrate that our method achieves both accuracy and computational efficiency, while uniquely providing uncertainty quantification. Finally, we demonstrate its application to assessing the quality of generative models for microbiome compositional data.

Two-sample comparison through additive tree models for density ratios

TL;DR

This work proposes additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss, which allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting.

Abstract

The ratio of two densities provides a direct characterization of their differences. We consider the two-sample comparison problem by estimating this ratio given i.i.d. observations from two distributions. To this end, we propose additive tree models for density ratio estimation along with efficient algorithms using a new loss function, the balancing loss. The loss allows tree-based models to be trained using several algorithms originally designed for supervised learning, such as forward-stagewise optimization and gradient boosting. Moreover, the balancing loss resembles an exponential family kernel, and it can serve as a pseudo-likelihood with conjugate priors. This property enables generalized Bayesian inference on the density ratio using backfitting samplers designed for Bayesian additive regression trees (BART). Our Bayesian strategy provides uncertainty quantification for the inferred density ratio, which is critical for applications involving high-dimensional and data-limited distributions with potentially substantial uncertainty. We further show connections of the balancing loss to the exponential loss in binary classification and to the variational form of f-divergence, particularly the squared Hellinger distance. Numerical experiments demonstrate that our method achieves both accuracy and computational efficiency, while uniquely providing uncertainty quantification. Finally, we demonstrate its application to assessing the quality of generative models for microbiome compositional data.

Paper Structure

This paper contains 32 sections, 2 theorems, 41 equations, 14 figures, 3 tables.

Key Result

Proposition 1

Suppose $w_{k-1}$ is the estimate of $\sqrt{r^*} = \sqrt{p/q}$ after $(k-1)$ steps of fitting. Then in the $k$th step, the loss is minimized by $\log w_{k}=\log w_{k-1}+f_k$ when the following conditions are satisfied by $f_k$:

Figures (14)

  • Figure 1: An evaluation of posterior distributions under the 1D scenarios with the balanced/unbalanced sample size. Each left plot compares the true log-densities (gray, dotted) and the posteriors (mean: black, solid, 95% interval: blue, dashed) evaluated pointwise. Each middle plot shows the posterior distributions of the inverse temperature $\tau^{-1}$ and the true Bhattacharyya coefficient. Each right plot visualizes the calibration plot comparing the posterior uncertainty and the nominal coverage rate (namely, the ratio of observations included in the pointwise credible intervals) based on 50 simulations and their averages.
  • Figure 2: A representative sample of simulations generated from the three scenarios used for the two-dimensional experiments. The sample sizes, $n_0$ and $n_1$, are both set to 5000.
  • Figure 3: The estimated log-density ratios obtained from the algorithms considered in the two-dimensional examples for the local shift scenario ($n_0 = n_1 = 5000$). For the Bayesian additive model, the 2.5% quantiles and 97.5% quantiles of the log-density ratios evaluated pointwise are also displayed. The difference between FS and GB is minimal, so we present only the results for GB.
  • Figure 4: A comparison of the true log-density ratios and the estimated ratios projected onto the space of the latent variables $(z_1, z_2)$, for the location shift scenario ($n_0 = n_1 = 5000$). For the Bayesian additive model, the 2.5% quantiles and 97.5% quantiles of the log-density ratios evaluated pointwise are also displayed. The difference in the results for FS and GB is minimal, so we present only the results for GB.
  • Figure 5: Principal Coordinates Analysis (PCoA) plots based on Bray-Curtis distance between observations. The blue and orange points represent the real sample and the generated sample, respectively. The first row presents the training sample vs. the generated sample, and the second row presents the test sample vs. the generated sample.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Proposition 1
  • Corollary 1