Fréchet random forests for metric space valued regression with non euclidean predictors
Louis Capitaine, Jérémie Bigot, Rodolphe Thiébaut, Robin Genuer
TL;DR
This work extends tree-based regression to metric-space valued data by introducing Fréchet trees and Fréchet random forests, enabling predictors and responses in general metric spaces. Core ideas include Voronoi-style splits guided by Fréchet variance and predictions via Fréchet means, with theoretical consistency results for Fréchet regressograms and a practical Fréchet RF framework with OOB error and variable-importance measures. The authors validate the methods through extensive simulations on longitudinal, image-curves, and mixed-input problems, and demonstrate a real-world air quality prediction application where FRF outperforms standard RF. The results show strong predictive performance, robustness to missing data and time shifts, and practical applicability to heterogeneous data, while acknowledging computational cost and mean-existence considerations as areas for future work.
Abstract
Random forests are a statistical learning method widely used in many areas of scientific research because of its ability to learn complex relationships between input and output variables and also its capacity to handle high-dimensional data. However, current random forest approaches are not flexible enough to handle heterogeneous data such as curves, images and shapes. In this paper, we introduce Fréchet trees and Fréchet random forests, which allow to handle data for which input and output variables take values in general metric spaces. To this end, a new way of splitting the nodes of trees is introduced and the prediction procedures of trees and forests are generalized. Then, random forests out-of-bag error and variable importance score are naturally adapted. A consistency theorem for Fréchet regressogram predictor using data-driven partitions is given and applied to Fréchet purely uniformly random trees. The method is studied through several simulation scenarios on heterogeneous data combining longitudinal, image and scalar data. Finally, one real dataset about air quality is used to illustrate the use of the proposed method in practice.
