Table of Contents
Fetching ...

Random Forest Calibration

Mohammad Hossein Shaker, Eyke Hüllermeier

TL;DR

The paper investigates how well Random Forest probability estimates can be calibrated and whether post-hoc calibration methods outperform or complement RF itself. It defines and contrasts class-wise, probability-wise, and multiclass calibration, and surveys a broad suite of calibration techniques applicable to RF outputs. Through extensive synthetic and real-data experiments, it shows that calibration performance depends on the chosen metric, with hyper-parameter tuning (notably tree depth) and ensemble size often achieving parity with or superiority over standard calibrators. The findings imply that a well-optimized RF can provide competitive, sometimes superior, calibrated probabilities without heavy reliance on external calibration models, informing practical deployment of RF in safety-critical domains.

Abstract

The Random Forest (RF) classifier is often claimed to be relatively well calibrated when compared with other machine learning methods. Moreover, the existing literature suggests that traditional calibration methods, such as isotonic regression, do not substantially enhance the calibration of RF probability estimates unless supplied with extensive calibration data sets, which can represent a significant obstacle in cases of limited data availability. Nevertheless, there seems to be no comprehensive study validating such claims and systematically comparing state-of-the-art calibration methods specifically for RF. To close this gap, we investigate a broad spectrum of calibration methods tailored to or at least applicable to RF, ranging from scaling techniques to more advanced algorithms. Our results based on synthetic as well as real-world data unravel the intricacies of RF probability estimates, scrutinize the impacts of hyper-parameters, compare calibration methods in a systematic way. We show that a well-optimized RF performs as well as or better than leading calibration approaches.

Random Forest Calibration

TL;DR

The paper investigates how well Random Forest probability estimates can be calibrated and whether post-hoc calibration methods outperform or complement RF itself. It defines and contrasts class-wise, probability-wise, and multiclass calibration, and surveys a broad suite of calibration techniques applicable to RF outputs. Through extensive synthetic and real-data experiments, it shows that calibration performance depends on the chosen metric, with hyper-parameter tuning (notably tree depth) and ensemble size often achieving parity with or superiority over standard calibrators. The findings imply that a well-optimized RF can provide competitive, sometimes superior, calibrated probabilities without heavy reliance on external calibration models, informing practical deployment of RF in safety-critical domains.

Abstract

The Random Forest (RF) classifier is often claimed to be relatively well calibrated when compared with other machine learning methods. Moreover, the existing literature suggests that traditional calibration methods, such as isotonic regression, do not substantially enhance the calibration of RF probability estimates unless supplied with extensive calibration data sets, which can represent a significant obstacle in cases of limited data availability. Nevertheless, there seems to be no comprehensive study validating such claims and systematically comparing state-of-the-art calibration methods specifically for RF. To close this gap, we investigate a broad spectrum of calibration methods tailored to or at least applicable to RF, ranging from scaling techniques to more advanced algorithms. Our results based on synthetic as well as real-world data unravel the intricacies of RF probability estimates, scrutinize the impacts of hyper-parameters, compare calibration methods in a systematic way. We show that a well-optimized RF performs as well as or better than leading calibration approaches.

Paper Structure

This paper contains 28 sections, 19 equations, 18 figures, 21 tables, 2 algorithms.

Figures (18)

  • Figure 1: This decision tree estimates the probability of the positive class as $1/4$ for all instances falling in either of the two leaf nodes shaded in grey. Class-wise calibration then requires that $\mathbb{P}(Y = +1 \, | \, \boldsymbol{x} \in A \cup B) = 1/4$, where $A, B \subset \mathcal{X}$ are the regions in the instance space associated with the two nodes, respectively. Note that this neither implies instance-wise calibration nor "leaf-wise" calibration (i.e., $\mathbb{P}(Y = +1 \, | \, \boldsymbol{x} \in A) = 1/4$ and $\mathbb{P}(Y = +1 \, | \, \boldsymbol{x} \in B) = 1/4$).
  • Figure 2: Illustration of loss decomposition for binary classification and a finite one-dimensional instance space $\mathcal{X}$ (the black points). The numbers at the bottom indicate the true probabilities of the positive class. The blue arrow indicates a split of the data into two groups, leading to probability estimates $\boldsymbol{p}(x) = (0.8, 0.2)$ for the left and $\boldsymbol{p}(x) = (0.2, 0.8)$ for the right group, based on the given training data (positive and negative class indicated by red crosses and black circles, respectively).
  • Figure 3: The effect of setting max-depth of trees in an RF on calibration performance on synthetic data. The reliability diagram represents RF with low (left), optimal (middle), and too high (right) value for parameter max-depth.
  • Figure 4: The impact of varying overlap between two Gaussian distributions on the performance of calibration methods analyzed using synthetic data across increasing dimensions. The results are organized into columns corresponding to each dimensionality—2, 5, 10, and 20—from left to right. The performance metrics—Brier score, TCE, and ECE—are displayed in rows from top to bottom.
  • Figure 5: Critical difference diagram of 30 real datasets on Brier score (left) and ECE (right).
  • ...and 13 more figures