Table of Contents
Fetching ...

Feature Importance Guided Random Forest Learning with Simulated Annealing Based Hyperparameter Tuning

Kowshik Balasubramanian, Andre Williams, Ismail Butun

TL;DR

The paper tackles the need for accurate and interpretable RF classifiers in diverse domains by introducing Feature Importance Guided Random Forest (FIGRF), which combines probabilistic feature sampling with Simulated Annealing (SA) based hyperparameter tuning. It builds a composite feature importance measure from Permutation Importance, Gini Importance, and Mutual Information, normalizes and averages them across trees, and converts them into a sampling distribution via a softmax function with temperature parameter $\alpha$. Each tree in the forest samples $m = \left\lfloor \sqrt{d} \right\rfloor$ features from this distribution and trains on bootstrap data, enabling focused yet diverse learning; hyperparameters $n_{estimators}$ and $max\_depth$ are optimized with SA by maximizing a joint metric (average of $Accuracy$, $Precision$, $Recall$, and $F1$). Empirical results on seven public datasets show that FIGRF either matches or exceeds standard RF performance, with strong improvements in imbalance-handling scenarios and insightful feature usage tracking for interpretability. The approach offers a scalable, robust, and interpretable extension to ensemble learning, suitable for domains ranging from credit risk to IoT anomaly detection and biomedical data analysis.

Abstract

This paper introduces a novel framework for enhancing Random Forest classifiers by integrating probabilistic feature sampling and hyperparameter tuning via Simulated Annealing. The proposed framework exhibits substantial advancements in predictive accuracy and generalization, adeptly tackling the multifaceted challenges of robust classification across diverse domains, including credit risk evaluation, anomaly detection in IoT ecosystems, early-stage medical diagnostics, and high-dimensional biological data analysis. To overcome the limitations of conventional Random Forests, we present an approach that places stronger emphasis on capturing the most relevant signals from data while enabling adaptive hyperparameter configuration. The model is guided towards features that contribute more meaningfully to classification and optimizing this with dynamic parameter tuning. The results demonstrate consistent accuracy improvements and meaningful insights into feature relevance, showcasing the efficacy of combining importance aware sampling and metaheuristic optimization.

Feature Importance Guided Random Forest Learning with Simulated Annealing Based Hyperparameter Tuning

TL;DR

The paper tackles the need for accurate and interpretable RF classifiers in diverse domains by introducing Feature Importance Guided Random Forest (FIGRF), which combines probabilistic feature sampling with Simulated Annealing (SA) based hyperparameter tuning. It builds a composite feature importance measure from Permutation Importance, Gini Importance, and Mutual Information, normalizes and averages them across trees, and converts them into a sampling distribution via a softmax function with temperature parameter . Each tree in the forest samples features from this distribution and trains on bootstrap data, enabling focused yet diverse learning; hyperparameters and are optimized with SA by maximizing a joint metric (average of , , , and ). Empirical results on seven public datasets show that FIGRF either matches or exceeds standard RF performance, with strong improvements in imbalance-handling scenarios and insightful feature usage tracking for interpretability. The approach offers a scalable, robust, and interpretable extension to ensemble learning, suitable for domains ranging from credit risk to IoT anomaly detection and biomedical data analysis.

Abstract

This paper introduces a novel framework for enhancing Random Forest classifiers by integrating probabilistic feature sampling and hyperparameter tuning via Simulated Annealing. The proposed framework exhibits substantial advancements in predictive accuracy and generalization, adeptly tackling the multifaceted challenges of robust classification across diverse domains, including credit risk evaluation, anomaly detection in IoT ecosystems, early-stage medical diagnostics, and high-dimensional biological data analysis. To overcome the limitations of conventional Random Forests, we present an approach that places stronger emphasis on capturing the most relevant signals from data while enabling adaptive hyperparameter configuration. The model is guided towards features that contribute more meaningfully to classification and optimizing this with dynamic parameter tuning. The results demonstrate consistent accuracy improvements and meaningful insights into feature relevance, showcasing the efficacy of combining importance aware sampling and metaheuristic optimization.

Paper Structure

This paper contains 13 sections, 10 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Top ten important features and Standard vs FIGRF performance - IOTID20 Dataset
  • Figure 2: Top ten important features and Standard vs FIGRF performance - UCI Darwin Dataset