Table of Contents
Fetching ...

Pulling the Carpet Below the Learner's Feet: Genetic Algorithm To Learn Ensemble Machine Learning Model During Concept Drift

Teddy Lazebnik

TL;DR

This paper addresses concept drift in streaming data and proposes a genetic-algorithm driven ensemble to maintain performance under evolving distributions. The authors introduce a two-level architecture consisting of a global ML model with a CD detector that aggregates a population of self-adapting pipelines, plus a divide-and-conquer variant that partitions data and leverages AutoML for subproblem optimization. The main contributions are the GA-based population design for CD handling, the integration of per-pipeline detectors with a global predictor, and extensive synthetic-data experiments showing robustness to unknown CD characteristics and drift patterns; the improved version demonstrates compatibility with several AutoML libraries. The results indicate superior resilience and more consistent performance compared with single-pipeline baselines, highlighting the practical potential of GA-driven ensemble methods for adaptive learning in non-stationary environments.

Abstract

Data-driven models, in general, and machine learning (ML) models, in particular, have gained popularity over recent years with an increased usage of such models across the scientific and engineering domains. When using ML models in realistic and dynamic environments, users need to often handle the challenge of concept drift (CD). In this study, we explore the application of genetic algorithms (GAs) to address the challenges posed by CD in such settings. We propose a novel two-level ensemble ML model, which combines a global ML model with a CD detector, operating as an aggregator for a population of ML pipeline models, each one with an adjusted CD detector by itself responsible for re-training its ML model. In addition, we show one can further improve the proposed model by utilizing off-the-shelf automatic ML methods. Through extensive synthetic dataset analysis, we show that the proposed model outperforms a single ML pipeline with a CD algorithm, particularly in scenarios with unknown CD characteristics. Overall, this study highlights the potential of ensemble ML and CD models obtained through a heuristic and adaptive optimization process such as the GA one to handle complex CD events.

Pulling the Carpet Below the Learner's Feet: Genetic Algorithm To Learn Ensemble Machine Learning Model During Concept Drift

TL;DR

This paper addresses concept drift in streaming data and proposes a genetic-algorithm driven ensemble to maintain performance under evolving distributions. The authors introduce a two-level architecture consisting of a global ML model with a CD detector that aggregates a population of self-adapting pipelines, plus a divide-and-conquer variant that partitions data and leverages AutoML for subproblem optimization. The main contributions are the GA-based population design for CD handling, the integration of per-pipeline detectors with a global predictor, and extensive synthetic-data experiments showing robustness to unknown CD characteristics and drift patterns; the improved version demonstrates compatibility with several AutoML libraries. The results indicate superior resilience and more consistent performance compared with single-pipeline baselines, highlighting the practical potential of GA-driven ensemble methods for adaptive learning in non-stationary environments.

Abstract

Data-driven models, in general, and machine learning (ML) models, in particular, have gained popularity over recent years with an increased usage of such models across the scientific and engineering domains. When using ML models in realistic and dynamic environments, users need to often handle the challenge of concept drift (CD). In this study, we explore the application of genetic algorithms (GAs) to address the challenges posed by CD in such settings. We propose a novel two-level ensemble ML model, which combines a global ML model with a CD detector, operating as an aggregator for a population of ML pipeline models, each one with an adjusted CD detector by itself responsible for re-training its ML model. In addition, we show one can further improve the proposed model by utilizing off-the-shelf automatic ML methods. Through extensive synthetic dataset analysis, we show that the proposed model outperforms a single ML pipeline with a CD algorithm, particularly in scenarios with unknown CD characteristics. Overall, this study highlights the potential of ensemble ML and CD models obtained through a heuristic and adaptive optimization process such as the GA one to handle complex CD events.

Paper Structure

This paper contains 25 sections, 4 equations, 4 figures, 6 tables, 1 algorithm.

Figures (4)

  • Figure 1: A schematic view of shift and moving CD. One can notice that the shift CD moves from one two-dimensional distribution ($x,y$) to another distribution drastically. On the other hand, the moving CD gradually alters from the same source distribution to the other distribution.
  • Figure 2: A schematic view of the learning problem during different CD types and possible remedy with ensemble ML model obtained using an initial search process. The distributions over time are shown as mean $\pm$ standard deviation of some random variable, as reflected on the y-axis. The x-axis indicates steps in time. In this example, a shift CD has occurred between the third and fourth steps in time. In addition, a moving CD has occurred between the sixth and tenth steps in time. A possible ensemble model to tackle this condition would detect the shift and moving CDs and use three models, one to capture the original data between the CDs, a second model that takes into account the recent, seemingly stable, data with some "tail" of the moving CD, and a third model that based only on the recent time without CD.
  • Figure 3: A schematic view of Algorithm 1.
  • Figure 4: A schematic view of a dataset generation process for the experiments.