Table of Contents
Fetching ...

SUDS: A Strategy for Unsupervised Drift Sampling

Christofer Fellicious, Lorenz Wendlinger, Mario Gancarski, Jelena Mitrovic, Michael Granitzer

TL;DR

The Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data and introducing the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance about the quantity of annotated data required to achieve the stated performance.

Abstract

Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data. SUDS seamlessly integrates with current drift detection techniques. We also introduce the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance in relation to the quantity of annotated data required to achieve the stated performance, thereby taking into account the difficulty of acquiring labeled data. Our contributions are twofold: SUDS combines drift detection with strategic sampling to improve the retraining process, and HADAM provides a metric that balances classifier performance with the amount of labeled data, ensuring efficient resource utilization. Empirical results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments, significantly improving the performance of machine learning applications in real-world scenarios. Our code is open source and available at https://github.com/cfellicious/SUDS/

SUDS: A Strategy for Unsupervised Drift Sampling

TL;DR

The Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data and introducing the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance about the quantity of annotated data required to achieve the stated performance.

Abstract

Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data. SUDS seamlessly integrates with current drift detection techniques. We also introduce the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance in relation to the quantity of annotated data required to achieve the stated performance, thereby taking into account the difficulty of acquiring labeled data. Our contributions are twofold: SUDS combines drift detection with strategic sampling to improve the retraining process, and HADAM provides a metric that balances classifier performance with the amount of labeled data, ensuring efficient resource utilization. Empirical results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments, significantly improving the performance of machine learning applications in real-world scenarios. Our code is open source and available at https://github.com/cfellicious/SUDS/

Paper Structure

This paper contains 9 sections, 6 equations, 2 figures, 5 tables, 2 algorithms.

Figures (2)

  • Figure 1: Average difference in HADAM for real-world datasets between OCDD with and without SUDS for different hyperparameter combinations. We see that our SUDS modifications perform better overall as there are no negative values. A negative value indicates that the original algorithm was better.
  • Figure 2: Comparison of HADAM between D3 and D3(SUDS) algorithms across all datasets and across only real-world datasets, along with their corresponding SUDS modifications. A positive value means that on average SUDS modification is better than the corresponding algorithm. We see that D3(SUDS) performs better for real-world datasets as there are only positive values in \ref{['fig:d3_hyperparam_realworld']}.