Fitting Multiple Machine Learning Models with Performance Based Clustering
Mehmet Efe Lorasdagi, Ahmet Berker Koc, Ali Taha Koc, Suleyman Serdar Kozat
TL;DR
The paper tackles nonstationarity by recognizing that real-world data often arise from multiple generating mechanisms, which degrades single-model performance. It proposes Performance Based Clustering (PBC), an EM-inspired framework that clusters data according to the relations between features and targets, learning a separate model per cluster. It further extends to online settings by forming a gradient-descent updated ensemble of the learned models for streaming predictions. Across synthetic and real-world datasets, PBC yields significant improvements over traditional single-model approaches, and the authors provide open-source code to support reproducibility.
Abstract
Traditional machine learning approaches assume that data comes from a single generating mechanism, which may not hold for most real life data. In these cases, the single mechanism assumption can result in suboptimal performance. We introduce a clustering framework that eliminates this assumption by grouping the data according to the relations between the features and the target values and we obtain multiple separate models to learn different parts of the data. We further extend our framework to applications having streaming data where we produce outcomes using an ensemble of models. For this, the ensemble weights are updated based on the incoming data batches. We demonstrate the performance of our approach over the widely-studied real life datasets, showing significant improvements over the traditional single-model approaches.
