Infinite random forests for imbalanced classification tasks

Moria Mayala; Olivier Wintenberger; Charles Tillier; Clément Dombry

Infinite random forests for imbalanced classification tasks

Moria Mayala, Olivier Wintenberger, Charles Tillier, Clément Dombry

TL;DR

This work proposes a debiasing procedure based on Importance Sampling (IS) using odds ratios that proves the nearly minimax optimality of the approach for Lipschitz continuous objectives and shows that the IS bagged 1-NN estimator matches the convergence rate of its subsampled counterpart while attaining lower asymptotic variance in most cases.

Abstract

We study predictive probability inference in classification tasks using random forests under class imbalance. We focus on two simplified variants of Breiman's algorithm, namely subsampling Infinite Random Forests (IRFs) and under-sampling IRFs, and establish their asymptotic normality. In the under-sampling setting, training data from both classes are resampled to achieve balance, which enhances minority class representation but introduces a biased model. To correct this, we propose a debiasing procedure based on Importance Sampling (IS) using odds ratios. We instantiate our results using 1-Nearest Neighbor (1-NN) classifiers as base learners in the IRFs and prove the nearly minimax optimality of the approach for Lipschitz continuous objectives. We also show that the IS bagged 1-NN estimator matches the convergence rate of its subsampled counterpart while attaining lower asymptotic variance in most cases. Our theoretical findings are supported by simulation studies, highlighting the empirical benefits of the proposed approach.

Infinite random forests for imbalanced classification tasks

TL;DR

Abstract

Paper Structure (58 sections, 27 theorems, 213 equations, 2 figures, 2 algorithms)

This paper contains 58 sections, 27 theorems, 213 equations, 2 figures, 2 algorithms.

Introduction
The problem and the models
Problem statement
The models
The original model
The biased classification model
The odds ratios and the Importance Sampling (IS) procedure
Framework and hypotheses
Random Forests (RF)
Infinite Random Forests (IRFs)
Subsampling IRFs
Under-sampling IRFs
Asymptotic properties of IRFs
CLT for subsampling and under-sampling IRFs
The subsampling case
...and 43 more sections

Key Result

Proposition 3.1

The subsampling IRF buhlmann2002analyzing called it subbagging which is a nickname for subsample aggregating, where subsampling is used instead of bootstrap resampling. estimator at point $\bm{x}\in \mathcal{X}$ may be written as where $T^s(\bm{x}; \mathbb{U};\bm Z_S)$ are individual trees evaluated at point $\bm{x}\in \mathcal{X}$ as defined in eq:T_onesample.

Figures (2)

Figure 1: Comparison of the MISE trend with that of the bias-variance decomposition w.r.t logarithm of the samples size for Setup 1.
Figure 2: Comparison of the MISE trend with that of the bias-variance decomposition w.r.t logarithm of the samples size for Setup 2.

Theorems & Definitions (70)

Definition 2.1
Proposition 3.1: IRF: the subsampling case
proof
Proposition 3.2: IRF: the under-sampling case
proof
Theorem 4.1
proof
Corollary 4.2
proof
Theorem 4.3
...and 60 more

Infinite random forests for imbalanced classification tasks

TL;DR

Abstract

Infinite random forests for imbalanced classification tasks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (70)