Table of Contents
Fetching ...

Distributed Genetic Algorithm for Feature Selection

Michael Potter, Ayberk Yarkın Yıldız, Nishanth Marer Prabhu, Cameron Gordon

TL;DR

This work tackles high-dimensional feature selection by employing a distributed genetic algorithm that encodes feature subsets as binary chromosomes and optimizes model performance on validation data. Through one-point crossover, mutation, elitism, and iterative evolution, the method is parallelized using PySpark and JobLib to evaluate many feature subsets simultaneously, achieving $2\times$ to $25\times$ speedups and often improving metrics such as Accuracy, F1, and ROC-AUC. Experiments on three OpenML2013 agnostic datasets (Sylva, Gina, Hiva) with various models demonstrate both computational efficiency and enhanced predictive performance, though reproducibility challenges arise due to parallelism-induced nondeterminism. The findings support scalable FS for high-dimensional tasks, enabling more extensive search and faster deployment in data-heavy ML pipelines.

Abstract

We empirically show that process-based Parallelism speeds up the Genetic Algorithm (GA) for Feature Selection (FS) 2x to 25x, while additionally increasing the Machine Learning (ML) model performance on metrics such as F1-score, Accuracy, and Receiver Operating Characteristic Area Under the Curve (ROC-AUC).

Distributed Genetic Algorithm for Feature Selection

TL;DR

This work tackles high-dimensional feature selection by employing a distributed genetic algorithm that encodes feature subsets as binary chromosomes and optimizes model performance on validation data. Through one-point crossover, mutation, elitism, and iterative evolution, the method is parallelized using PySpark and JobLib to evaluate many feature subsets simultaneously, achieving to speedups and often improving metrics such as Accuracy, F1, and ROC-AUC. Experiments on three OpenML2013 agnostic datasets (Sylva, Gina, Hiva) with various models demonstrate both computational efficiency and enhanced predictive performance, though reproducibility challenges arise due to parallelism-induced nondeterminism. The findings support scalable FS for high-dimensional tasks, enabling more extensive search and faster deployment in data-heavy ML pipelines.

Abstract

We empirically show that process-based Parallelism speeds up the Genetic Algorithm (GA) for Feature Selection (FS) 2x to 25x, while additionally increasing the Machine Learning (ML) model performance on metrics such as F1-score, Accuracy, and Receiver Operating Characteristic Area Under the Curve (ROC-AUC).
Paper Structure (25 sections, 3 equations, 7 figures, 6 tables)

This paper contains 25 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Example of chromosome being used for feature selection on the Iris Dataset bezdek1999will
  • Figure 2: General Genetic Algorithm Flow Diagram Liao_Sun_2001
  • Figure 3: Advantage in Time Consumption because of Parallelis. Figure only applies to one Evolution Round
  • Figure 4: Genetic Algorithm Code Flow
  • Figure 5: Genetic Algorithm Validation Score per Iteration
  • ...and 2 more figures