Distributed Genetic Algorithm for Feature Selection
Michael Potter, Ayberk Yarkın Yıldız, Nishanth Marer Prabhu, Cameron Gordon
TL;DR
This work tackles high-dimensional feature selection by employing a distributed genetic algorithm that encodes feature subsets as binary chromosomes and optimizes model performance on validation data. Through one-point crossover, mutation, elitism, and iterative evolution, the method is parallelized using PySpark and JobLib to evaluate many feature subsets simultaneously, achieving $2\times$ to $25\times$ speedups and often improving metrics such as Accuracy, F1, and ROC-AUC. Experiments on three OpenML2013 agnostic datasets (Sylva, Gina, Hiva) with various models demonstrate both computational efficiency and enhanced predictive performance, though reproducibility challenges arise due to parallelism-induced nondeterminism. The findings support scalable FS for high-dimensional tasks, enabling more extensive search and faster deployment in data-heavy ML pipelines.
Abstract
We empirically show that process-based Parallelism speeds up the Genetic Algorithm (GA) for Feature Selection (FS) 2x to 25x, while additionally increasing the Machine Learning (ML) model performance on metrics such as F1-score, Accuracy, and Receiver Operating Characteristic Area Under the Curve (ROC-AUC).
