Table of Contents
Fetching ...

Fréchet random forests for metric space valued regression with non euclidean predictors

Louis Capitaine, Jérémie Bigot, Rodolphe Thiébaut, Robin Genuer

TL;DR

This work extends tree-based regression to metric-space valued data by introducing Fréchet trees and Fréchet random forests, enabling predictors and responses in general metric spaces. Core ideas include Voronoi-style splits guided by Fréchet variance and predictions via Fréchet means, with theoretical consistency results for Fréchet regressograms and a practical Fréchet RF framework with OOB error and variable-importance measures. The authors validate the methods through extensive simulations on longitudinal, image-curves, and mixed-input problems, and demonstrate a real-world air quality prediction application where FRF outperforms standard RF. The results show strong predictive performance, robustness to missing data and time shifts, and practical applicability to heterogeneous data, while acknowledging computational cost and mean-existence considerations as areas for future work.

Abstract

Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fréchet trees and Fréchet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fréchet regressogram predictor using data-driven partitions is given and applied to Fréchet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice.

Fréchet random forests for metric space valued regression with non euclidean predictors

TL;DR

This work extends tree-based regression to metric-space valued data by introducing Fréchet trees and Fréchet random forests, enabling predictors and responses in general metric spaces. Core ideas include Voronoi-style splits guided by Fréchet variance and predictions via Fréchet means, with theoretical consistency results for Fréchet regressograms and a practical Fréchet RF framework with OOB error and variable-importance measures. The authors validate the methods through extensive simulations on longitudinal, image-curves, and mixed-input problems, and demonstrate a real-world air quality prediction application where FRF outperforms standard RF. The results show strong predictive performance, robustness to missing data and time shifts, and practical applicability to heterogeneous data, while acknowledging computational cost and mean-existence considerations as areas for future work.

Abstract

Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fréchet trees and Fréchet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fréchet regressogram predictor using data-driven partitions is given and applied to Fréchet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice.

Paper Structure

This paper contains 39 sections, 4 theorems, 82 equations, 13 figures, 1 table.

Key Result

Lemma 1

Let $\mathcal{A}$ be any collection of partitions of $\mathbb{R}^p$. For every $n\geq 1$ and every $\epsilon>0$,

Figures (13)

  • Figure 1: Dynamics of $n=100$ simulated input trajectories according to the model \ref{['simX']}
  • Figure 2: The first two lines show the time behavior functions for schemes \ref{['simX']} and \ref{['simX3']}, for the first two input variables $X^{(1)}$ and $X^{(2)}$ and the output $Y$. The third row shows 50 simulated dynamics according to scheme \ref{['simX3']} (see Appendix \ref{['Complement3']}).
  • Figure 3: Boxplots of the prediction error of the Fréchet random forests method according to the mtry parameter. Prediction errors are calculated on 100 datasets of size $n=100$ simulated according to models \ref{['simX']} and \ref{['simY']} of the first scenario.
  • Figure 4: Boxplots of the prediction error (MSE) of the Linear mixed effects model (LMEM), CART tree, random forests (RF), FDboost, Fréchet tree (Ftree) and Fréchet random forest (FRF) methods estimated on 100 datasets simulated according to the simulation scheme of the first scenario for $n=100,\ 200,\ 400$ and $1000$ sample sizes.
  • Figure 5: Boxplots of the prediction error (MSE) and computation times estimated over 100 datasets of sample size $n=100$ simulated under models \ref{['simX']} and \ref{['simY']} for Fréchet RF (FRF) method and Extremely Randomized Fréchet RF (ERFRF) method with different values of ntry.
  • ...and 8 more figures

Theorems & Definitions (9)

  • Lemma 1
  • Definition 1: Doubling dimension
  • Definition 2: Covering numbers
  • Lemma 2
  • Theorem 1
  • proof
  • Definition 3: Fréchet purely uniformly random tree
  • Corollary 1
  • proof