Table of Contents
Fetching ...

Generative Forests

Richard Nock, Mathieu Guillame-Bert

TL;DR

A new powerful class of forest-based models fit for tabular data tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting are introduced.

Abstract

We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces two key contributions: a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting. This algorithm can be implemented by a few tweaks to the most popular induction scheme for decision tree induction (i.e. supervised learning) with two classes. Experiments on the quality of generated data display substantial improvements compared to the state of the art. The losses our algorithm minimize and the structure of our models make them practical for related tasks that require fast estimation of a density given a generative model and an observation (even partially specified): such tasks include missing data imputation and density estimation. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods, relying on models as diverse as (or mixing elements of) trees, neural nets, kernels or graphical models.

Generative Forests

TL;DR

A new powerful class of forest-based models fit for tabular data tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting are introduced.

Abstract

We focus on generative AI for a type of data that still represent one of the most prevalent form of data: tabular data. Our paper introduces two key contributions: a new powerful class of forest-based models fit for such tasks and a simple training algorithm with strong convergence guarantees in a boosting model that parallels that of the original weak / strong supervised learning setting. This algorithm can be implemented by a few tweaks to the most popular induction scheme for decision tree induction (i.e. supervised learning) with two classes. Experiments on the quality of generated data display substantial improvements compared to the state of the art. The losses our algorithm minimize and the structure of our models make them practical for related tasks that require fast estimation of a density given a generative model and an observation (even partially specified): such tasks include missing data imputation and density estimation. Additional experiments on these tasks reveal that our models can be notably good contenders to diverse state of the art methods, relying on models as diverse as (or mixing elements of) trees, neural nets, kernels or graphical models.
Paper Structure (5 sections, 2 theorems, 5 equations, 4 figures, 2 tables, 3 algorithms)

This paper contains 5 sections, 2 theorems, 5 equations, 4 figures, 2 tables, 3 algorithms.

Key Result

Lemma 4.3

In StarUpdate, it always holds that the input $\mathcal{C}$ satisfies $\mathcal{C} \subseteq \mathcal{X}_{\hbox{\tiny$\Upsilon.\nu^\star$}}$.

Figures (4)

  • Figure 1: Sketch of comparison of two approaches to generate one observation, using Adversarial Random Forests wbkwAR (left) and using generative forests, gf (right, this paper). In the case of Adversarial Random Forests, a tree is sampled uniformly at random, then a leaf is sampled in the tree and finally an observation is sampled according to the distribution "attached" to the leaf. Hence, only one tree is used to generate an observation. In our case, we leverage the combinatorial power of the trees in the forest: all trees are used to generate one observation, as each is contributing to one leaf. Figure \ref{['fig:generation-gt-eogt-smplesupport']} provides more details on generation using gf.
  • Figure 2: A gf ($T=2$) associated to UCI German Credit. Constraint (C) (see text) implies that the domain of "Number existing credits" is $\{0, 1, ..., 8\}$, that of "Job" is $\{$A171, A172, A173, A174$\}$, etc. .
  • Figure 3: From left to right and top to bottom: updates of the argument $\mathcal{C}$ of StarUpdate through a sequence of run of StarUpdate in a generative forest consisting of three trees (the partition of the domain induced by each tree is also depicted, alongside the nature of splits, vertical or horizontal, at each internal node) whose star nodes are indicated with chess pieces (, , ). In each picture, $\mathcal{C}$ is represented at the bottom of the picture (hence, $\mathcal{C} = \mathcal{X}$ after Init). In the bottom-right picture, all star nodes are leaves and thus $\mathcal{C}_{\hbox{\tiny{s}}}=\mathcal{C}$ displays the portion of the domain in which an observation is sampled. Remark that the last star node update produced no change in $\mathcal{C}$.
  • Figure 4: (Top row) Density estimation using a gf, on an observation indicated by ${\color{darkgreen} \bullet}$ (Left). In each tree, the leaf reached by the observation is found and the intersection of all leaves' supports is computed. The estimated density at ${\color{darkgreen} \bullet}$ is computed as the empirical measure in this intersection over its volume (Right). (Bottom row) Missing data imputation using the same gf, and an observation with one missing value (if $\mathcal{X} \subset \mathbb{R}^2$, then $y$ is missing). We first proceed like in density estimation, finding in each tree all leaves potentially reached by the observation if $y$ were known (Left); then, we compute the density in each non-empty intersection of all leaves' supports; among the corresponding elements with maximal density, we get the missing value(s) by uniform sampling (Right).

Theorems & Definitions (4)

  • Definition 4.1
  • Definition 4.2
  • Lemma 4.3
  • Lemma 4.4