Table of Contents
Fetching ...

The Generalized Proximity Forest

Ben Shaw, Adam Rustad, Sofia Pelagalli Maia, Jake S. Rhodes, Kevin R. Moon

TL;DR

The paper addresses the limitation of RF proximities to tabular data by introducing the generalized Proximity Forest (PF), which supports any distance measure and thus extends proximity-based learning to graphs, variable-length time series, and vector-valued data. It also provides a regression variant and a meta-imputation framework that leverages pretrained classifiers as imputers via GAP proximities. Through extensive experiments on outlier detection, missing data imputation, graph classification, and regression, the authors demonstrate that GAP proximities can outperform KNN in imputation tasks and that the generalized PF can match or exceed RF and KNN performance in several domains, with favorable scalability over brute-force nearest-neighbor methods. The approach offers practical impact by enabling model-informed, distance-based learning across diverse data types and by enabling pretrained models to contribute to imputation pipelines without retraining the underlying predictors.

Abstract

Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur. Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier. We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the $k$-nearest neighbors model.

The Generalized Proximity Forest

TL;DR

The paper addresses the limitation of RF proximities to tabular data by introducing the generalized Proximity Forest (PF), which supports any distance measure and thus extends proximity-based learning to graphs, variable-length time series, and vector-valued data. It also provides a regression variant and a meta-imputation framework that leverages pretrained classifiers as imputers via GAP proximities. Through extensive experiments on outlier detection, missing data imputation, graph classification, and regression, the authors demonstrate that GAP proximities can outperform KNN in imputation tasks and that the generalized PF can match or exceed RF and KNN performance in several domains, with favorable scalability over brute-force nearest-neighbor methods. The approach offers practical impact by enabling model-informed, distance-based learning across diverse data types and by enabling pretrained models to contribute to imputation pipelines without retraining the underlying predictors.

Abstract

Recent work has demonstrated the utility of Random Forest (RF) proximities for various supervised machine learning tasks, including outlier detection, missing data imputation, and visualization. However, the utility of the RF proximities depends upon the success of the RF model, which itself is not the ideal model in all contexts. RF proximities have recently been extended to time series by means of the distance-based Proximity Forest (PF) model, among others, affording time series analysis with the benefits of RF proximities. In this work, we introduce the generalized PF model, thereby extending RF proximities to all contexts in which supervised distance-based machine learning can occur. Additionally, we introduce a variant of the PF model for regression tasks. We also introduce the notion of using the generalized PF model as a meta-learning framework, extending supervised imputation capability to any pre-trained classifier. We experimentally demonstrate the unique advantages of the generalized PF model compared with both the RF model and the -nearest neighbors model.

Paper Structure

This paper contains 17 sections, 5 equations, 2 figures.

Figures (2)

  • Figure 1: Left: MDS embedding of the palmer penguin dataset using PFGAP proximities. Right: the same embedding with points having the highest outlier scores for each class highlighted in red.
  • Figure 2: Critical difference plot of the RF, PF, and KNN models in 31 small, vector-valued datasets. The RF model usually ranks the best, followed by the PF model with 100 trees. The KNN model and the PF model with 11 trees are statistically tied.