Table of Contents
Fetching ...

Fitting Multiple Machine Learning Models with Performance Based Clustering

Mehmet Efe Lorasdagi, Ahmet Berker Koc, Ali Taha Koc, Suleyman Serdar Kozat

TL;DR

The paper tackles nonstationarity by recognizing that real-world data often arise from multiple generating mechanisms, which degrades single-model performance. It proposes Performance Based Clustering (PBC), an EM-inspired framework that clusters data according to the relations between features and targets, learning a separate model per cluster. It further extends to online settings by forming a gradient-descent updated ensemble of the learned models for streaming predictions. Across synthetic and real-world datasets, PBC yields significant improvements over traditional single-model approaches, and the authors provide open-source code to support reproducibility.

Abstract

Traditional machine learning approaches assume that data comes from a single generating mechanism, which may not hold for most real life data. In these cases, the single mechanism assumption can result in suboptimal performance. We introduce a clustering framework that eliminates this assumption by grouping the data according to the relations between the features and the target values and we obtain multiple separate models to learn different parts of the data. We further extend our framework to applications having streaming data where we produce outcomes using an ensemble of models. For this, the ensemble weights are updated based on the incoming data batches. We demonstrate the performance of our approach over the widely-studied real life datasets, showing significant improvements over the traditional single-model approaches.

Fitting Multiple Machine Learning Models with Performance Based Clustering

TL;DR

The paper tackles nonstationarity by recognizing that real-world data often arise from multiple generating mechanisms, which degrades single-model performance. It proposes Performance Based Clustering (PBC), an EM-inspired framework that clusters data according to the relations between features and targets, learning a separate model per cluster. It further extends to online settings by forming a gradient-descent updated ensemble of the learned models for streaming predictions. Across synthetic and real-world datasets, PBC yields significant improvements over traditional single-model approaches, and the authors provide open-source code to support reproducibility.

Abstract

Traditional machine learning approaches assume that data comes from a single generating mechanism, which may not hold for most real life data. In these cases, the single mechanism assumption can result in suboptimal performance. We introduce a clustering framework that eliminates this assumption by grouping the data according to the relations between the features and the target values and we obtain multiple separate models to learn different parts of the data. We further extend our framework to applications having streaming data where we produce outcomes using an ensemble of models. For this, the ensemble weights are updated based on the incoming data batches. We demonstrate the performance of our approach over the widely-studied real life datasets, showing significant improvements over the traditional single-model approaches.

Paper Structure

This paper contains 9 sections, 15 equations, 2 figures, 2 tables, 1 algorithm.

Figures (2)

  • Figure 1: The averages of the misclassification rates over 25 simulations for the $\mathcal{D}_k$'s over iterations.
  • Figure 2: The ensemble weights of the clusters with respect to the sequentially arriving test batches for the M4 Weekly dataset.