Centroid Decision Forest
Amjad Ali, Saeed Aldahmani, Hailiang Du, Zardad Khan
TL;DR
The Centroid Decision Forest (CDF) addresses the challenge of high-dimensional classification by replacing threshold-based splits with centroid-based partitions guided by a class separability score (CSS) for discriminative feature selection. An ensemble of centroid decision trees (CDTs) is built using bootstrapping and random feature subsetting, with predictions aggregated by majority voting to improve robustness and scalability. Across 23 high-dimensional datasets, CDF delivers superior accuracy and Cohen’s kappa in many cases, highlighting its effectiveness for complex pattern recognition tasks while maintaining interpretability. The work emphasizes practical applicability, discusses limitations related to distance measures for non-numeric data, and points to future directions in flexible distances, hybrid feature selection, and scalable implementations.
Abstract
This paper introduces the centroid decision forest (CDF), a novel ensemble learning framework that redefines the splitting strategy and tree building in the ordinary decision trees for high-dimensional classification. The splitting approach in CDF differs from the traditional decision trees in theat the class separability score (CSS) determines the selection of the most discriminative features at each node to construct centroids of the partitions (daughter nodes). The splitting criterion uses the Euclidean distance measurements from each class centroid to achieve a splitting mechanism that is more flexible and robust. Centroids are constructed by computing the mean feature values of the selected features for each class, ensuring a class-representative division of the feature space. This centroid-driven approach enables CDF to capture complex class structures while maintaining interpretability and scalability. To evaluate CDF, 23 high-dimensional datasets are used to assess its performance against different state-of-the-art classifiers through classification accuracy and Cohen's kappa statistic. The experimental results show that CDF outperforms the conventional methods establishing its effectiveness and flexibility for high-dimensional classification problems.
