Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning

Takuo Matsubara

Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning

Takuo Matsubara

TL;DR

This paper presents Wasserstein gradient boosting (WGBoost), a framework for distribution-valued supervised learning that outputs a particle-based estimate of the input-conditioned distribution $p(\theta|x)$. By marrying Wasserstein gradient flows with gradient boosting, WGBoost iteratively updates $N$ ensembles to produce a nonparametric distribution over $\Theta$ for each input, using a KL divergence loss and kernel-smoothed gradients; a diagonal Newton variant (WEvidential) yields a second-order, computationally efficient approach. The default evidential learning setting uses individual-level posteriors $p(\theta|y_i)$ as outputs, enabling predictive distributions via $p(y|x) = \frac{1}{N}\sum_{n} p(y|\theta^n(x))$ and Bayes actions that minimise average risk. Empirically, WGBoost improves probabilistic forecasts and out-of-distribution detection on real-world tabular data compared with established uncertainty quantification methods.

Abstract

Gradient boosting is a sequential ensemble method that fits a new weaker learner to pseudo residuals at each iteration. We propose Wasserstein gradient boosting, a novel extension of gradient boosting that fits a new weak learner to alternative pseudo residuals that are Wasserstein gradients of loss functionals of probability distributions assigned at each input. It solves distribution-valued supervised learning, where the output values of the training dataset are probability distributions for each input. In classification and regression, a model typically returns, for each input, a point estimate of a parameter of a noise distribution specified for a response variable, such as the class probability parameter of a categorical distribution specified for a response label. A main application of Wasserstein gradient boosting in this paper is tree-based evidential learning, which returns a distributional estimate of the response parameter for each input. We empirically demonstrate the superior performance of the probabilistic prediction by Wasserstein gradient boosting in comparison with existing uncertainty quantification methods.

Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning

TL;DR

This paper presents Wasserstein gradient boosting (WGBoost), a framework for distribution-valued supervised learning that outputs a particle-based estimate of the input-conditioned distribution

. By marrying Wasserstein gradient flows with gradient boosting, WGBoost iteratively updates

ensembles to produce a nonparametric distribution over

for each input, using a KL divergence loss and kernel-smoothed gradients; a diagonal Newton variant (WEvidential) yields a second-order, computationally efficient approach. The default evidential learning setting uses individual-level posteriors

as outputs, enabling predictive distributions via

and Bayes actions that minimise average risk. Empirically, WGBoost improves probabilistic forecasts and out-of-distribution detection on real-world tabular data compared with established uncertainty quantification methods.

Abstract

Paper Structure (27 sections, 43 equations, 6 figures, 3 tables, 4 algorithms)

This paper contains 27 sections, 43 equations, 6 figures, 3 tables, 4 algorithms.

Introduction
General Formulation of Wasserstein Gradient Boosting
Wasserstein Gradient Flow
Gradient Boosting
Wasserstein Gradient Boosting
Default Setting for Tree-Based Evidential Learning
Individual-Level Posteriors as Output Distributions
Choice of Individual-Level Priors
Approximate Wasserstein Gradient of KL Divergence
Second-Order Implementation of WGBoost
Applications with Real-world Tabular Data
Illustrative Conditional Density Estimation
Probabilistic Regression Benchmark
Classification and Out-of-Distribution Detection
Discussion
...and 12 more sections

Figures (6)

Figure 1: Illustration of inputs and outputs of WGBoost trained on a training set $\{ x_i, \mu_i \}_{i=1}^{10}$ whose inputs are 10 grid points in $[-3.5, 3.5]$ and output distributions are each a normal distribution $\mu_i(\theta) = \mathcal{N}(\theta \mid \sin(x_i), 0.5)$ over $\theta \in \mathbb{R}$. The blue area indicates the $95$% high probability region of the output distribution for each point. WGBoost returns $N$ particles (red lines) that predicts the output distribution for each input, where this illustration selects $N = 10$ and uses a Gaussian kernel regressor as each weaker learner of WGBoost.
Figure 2: Comparison of the pipeline of (a) Bayesian learning and (b) evidential learning based on WGBoost. The former uses the (global) posterior $p(w \mid \{ x_i, y_i \}_{i=1}^{D})$ of the model parameter $w$ conditional on all data, and samples multiple models from it. The latter uses the individual-level posterior $p(\theta \mid y_i)$ of the response parameter $\theta$ conditional on each individual datum $y_i$ as the output distribution in the training set, and trains WGBoost to directly returns a particle-based distributional estimate $p(\theta \mid x)$ of $\theta$ for each input $x$.
Figure 3: Conditional density estimation for the bone mineral density dataset (grey dots) by WEvidential, where the normal response distribution $\mathcal{N}(y \mid m, \sigma)$ is specified for the response variable $y$. Left: distributional estimate (10 particles) of the location parameter $\{ m^n(x)\}_{n=1}^{10}$ for each input. Right: estimated density \ref{['eq:BMA']} based on the normal response distribution averaged over the output particles $\{ ( m^n(x), \sigma^n(x) ) \}_{n=1}^{10}$.
Figure 4: Conditional density estimation for the old faithful geyser dataset (grey dots) by WEvidential. Left: distributional estimate (10 particles) of the location parameter for each input. Right: estimated density by the predictive distribution \ref{['eq:BMA']} based on the output particles.
Figure 5: The approximation error and computational time of the four algorithms. Left: approximation error of each algorithm measured by the MMD averaged over the inputs. Right: computational time with respect to the weak learner number in common logarithm scale.
...and 1 more figures

Theorems & Definitions (6)

Remark 1: Stochastic WGBoost
Remark 2: Second-Order WGBoost
Remark 3: Difference from Bayesian Learning
Example 1: Normal Location-Scale
Example 2: Categorical
Remark 4: Computation

Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning

TL;DR

Abstract

Wasserstein Gradient Boosting: A Framework for Distribution-Valued Supervised Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (6)