Distributed Genetic Algorithm for Feature Selection

Michael Potter; Ayberk Yarkın Yıldız; Nishanth Marer Prabhu; Cameron Gordon

Distributed Genetic Algorithm for Feature Selection

Michael Potter, Ayberk Yarkın Yıldız, Nishanth Marer Prabhu, Cameron Gordon

TL;DR

This work tackles high-dimensional feature selection by employing a distributed genetic algorithm that encodes feature subsets as binary chromosomes and optimizes model performance on validation data. Through one-point crossover, mutation, elitism, and iterative evolution, the method is parallelized using PySpark and JobLib to evaluate many feature subsets simultaneously, achieving $2\times$ to $25\times$ speedups and often improving metrics such as Accuracy, F1, and ROC-AUC. Experiments on three OpenML2013 agnostic datasets (Sylva, Gina, Hiva) with various models demonstrate both computational efficiency and enhanced predictive performance, though reproducibility challenges arise due to parallelism-induced nondeterminism. The findings support scalable FS for high-dimensional tasks, enabling more extensive search and faster deployment in data-heavy ML pipelines.

Abstract

We empirically show that process-based Parallelism speeds up the Genetic Algorithm (GA) for Feature Selection (FS) 2x to 25x, while additionally increasing the Machine Learning (ML) model performance on metrics such as F1-score, Accuracy, and Receiver Operating Characteristic Area Under the Curve (ROC-AUC).

Distributed Genetic Algorithm for Feature Selection

TL;DR

speedups and often improving metrics such as Accuracy, F1, and ROC-AUC. Experiments on three OpenML2013 agnostic datasets (Sylva, Gina, Hiva) with various models demonstrate both computational efficiency and enhanced predictive performance, though reproducibility challenges arise due to parallelism-induced nondeterminism. The findings support scalable FS for high-dimensional tasks, enabling more extensive search and faster deployment in data-heavy ML pipelines.

Abstract

Paper Structure (25 sections, 3 equations, 7 figures, 6 tables)

This paper contains 25 sections, 3 equations, 7 figures, 6 tables.

Introduction
Methodology
Genetic Algorithm for Feature Selection
One-Point Cross-Over
Mutation
Elitism
Evolution Rounds
Parallelism of the Genetic Algorithm For Feature Selection
Our Implementation
Experiments
Configuration Details
Datasets
Sylva Agnostic
Gina Agnostic
Hiva Agnostic
...and 10 more sections

Figures (7)

Figure 1: Example of chromosome being used for feature selection on the Iris Dataset bezdek1999will
Figure 2: General Genetic Algorithm Flow Diagram Liao_Sun_2001
Figure 3: Advantage in Time Consumption because of Parallelis. Figure only applies to one Evolution Round
Figure 4: Genetic Algorithm Code Flow
Figure 5: Genetic Algorithm Validation Score per Iteration
...and 2 more figures

Distributed Genetic Algorithm for Feature Selection

TL;DR

Abstract

Distributed Genetic Algorithm for Feature Selection

Authors

TL;DR

Abstract

Table of Contents

Figures (7)