Table of Contents
Fetching ...

Centroid Decision Forest

Amjad Ali, Saeed Aldahmani, Hailiang Du, Zardad Khan

TL;DR

The Centroid Decision Forest (CDF) addresses the challenge of high-dimensional classification by replacing threshold-based splits with centroid-based partitions guided by a class separability score (CSS) for discriminative feature selection. An ensemble of centroid decision trees (CDTs) is built using bootstrapping and random feature subsetting, with predictions aggregated by majority voting to improve robustness and scalability. Across 23 high-dimensional datasets, CDF delivers superior accuracy and Cohen’s kappa in many cases, highlighting its effectiveness for complex pattern recognition tasks while maintaining interpretability. The work emphasizes practical applicability, discusses limitations related to distance measures for non-numeric data, and points to future directions in flexible distances, hybrid feature selection, and scalable implementations.

Abstract

This paper introduces the centroid decision forest (CDF), a novel ensemble learning framework that redefines the splitting strategy and tree building in the ordinary decision trees for high-dimensional classification. The splitting approach in CDF differs from the traditional decision trees in theat the class separability score (CSS) determines the selection of the most discriminative features at each node to construct centroids of the partitions (daughter nodes). The splitting criterion uses the Euclidean distance measurements from each class centroid to achieve a splitting mechanism that is more flexible and robust. Centroids are constructed by computing the mean feature values of the selected features for each class, ensuring a class-representative division of the feature space. This centroid-driven approach enables CDF to capture complex class structures while maintaining interpretability and scalability. To evaluate CDF, 23 high-dimensional datasets are used to assess its performance against different state-of-the-art classifiers through classification accuracy and Cohen's kappa statistic. The experimental results show that CDF outperforms the conventional methods establishing its effectiveness and flexibility for high-dimensional classification problems.

Centroid Decision Forest

TL;DR

The Centroid Decision Forest (CDF) addresses the challenge of high-dimensional classification by replacing threshold-based splits with centroid-based partitions guided by a class separability score (CSS) for discriminative feature selection. An ensemble of centroid decision trees (CDTs) is built using bootstrapping and random feature subsetting, with predictions aggregated by majority voting to improve robustness and scalability. Across 23 high-dimensional datasets, CDF delivers superior accuracy and Cohen’s kappa in many cases, highlighting its effectiveness for complex pattern recognition tasks while maintaining interpretability. The work emphasizes practical applicability, discusses limitations related to distance measures for non-numeric data, and points to future directions in flexible distances, hybrid feature selection, and scalable implementations.

Abstract

This paper introduces the centroid decision forest (CDF), a novel ensemble learning framework that redefines the splitting strategy and tree building in the ordinary decision trees for high-dimensional classification. The splitting approach in CDF differs from the traditional decision trees in theat the class separability score (CSS) determines the selection of the most discriminative features at each node to construct centroids of the partitions (daughter nodes). The splitting criterion uses the Euclidean distance measurements from each class centroid to achieve a splitting mechanism that is more flexible and robust. Centroids are constructed by computing the mean feature values of the selected features for each class, ensuring a class-representative division of the feature space. This centroid-driven approach enables CDF to capture complex class structures while maintaining interpretability and scalability. To evaluate CDF, 23 high-dimensional datasets are used to assess its performance against different state-of-the-art classifiers through classification accuracy and Cohen's kappa statistic. The experimental results show that CDF outperforms the conventional methods establishing its effectiveness and flexibility for high-dimensional classification problems.

Paper Structure

This paper contains 11 sections, 4 theorems, 25 equations, 6 figures, 3 tables, 4 algorithms.

Key Result

Lemma 1

CSS and Feature Discriminability If $\text{CSS}_j > \text{CSS}_k$ for two features $j$ and $k$, then feature $j$ is more discriminative than feature $k$ for classification.

Figures (6)

  • Figure 1: Boxplot of classification accuracy for the proposed CDF and state-of-the-art methods across multiple datasets. Accuracy is averaged over 500 repeated training-testing splits.
  • Figure 2: Boxplot of Cohen’s kappa for the proposed CDF and state-of-the-art methods across multiple datasets. Kappa values are averaged over 500 repeated training-testing splits
  • Figure 3: Structure of the CDT, illustrating selected features, centroids and splitting. At the first node, it selects top features (i.e., $X_{377}$, $X_{493}$) via CSS and splits data using class centroids (i.e., $(0.96, 0.97, 0.91)$ vs. $(-0.55, -0.57, -0.49)$). At each subsequent node, centroid-based partitioning refines separation, with Wilcoxon tests ($p < 0.001$) confirming feature significance. Final classification uses majority voting in the leaf nodes.
  • Figure 4: Classification accuracy in the CDF improves as more trees are added, stabilizing beyond 300 due to diminishing returns.
  • Figure 5: Impact of the percentage of randomly selected features per node on classification accuracy in the CDF.
  • ...and 1 more figures

Theorems & Definitions (10)

  • Definition 1
  • Lemma 1
  • proof
  • Theorem 1
  • proof
  • Definition 2
  • Theorem 2
  • proof
  • Lemma 2
  • proof