Table of Contents
Fetching ...

A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation

Vennela Yarabolu, Govind Waghmare, Sonia Gupta, Siddhartha Asthana

TL;DR

This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner, providing a robust, adaptable, and scalable approach to maintaining high-performance ML systems across various applications.

Abstract

In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift, a phenomenon where discrepancies between historical training data and future test data lead to significant performance degradation and operational inefficiencies. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data, and focus primarily on either covariate drift or concept drift. These methods face issues such as high resource demands, inability to manage all types of drifts effectively, and neglecting the valuable context that historical data can provide. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness. This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner. Our framework employs sophisticated data segmentation techniques to identify optimal data batches that accurately reflect test data patterns. These data batches are then utilized for training on test data, ensuring that the models remain relevant and accurate over time. By leveraging the advantages of both data segmentation and scalable drift management, our solution ensures robust model accuracy and operational efficiency in large-scale ML deployments. It also minimizes resource consumption and computational overhead by selecting and utilizing relevant data subsets, leading to significant cost savings. Experimental results on classification task on real-world and synthetic datasets show our approach improves model accuracy while reducing operational costs and latency. This practical solution overcomes inefficiencies in current methods, providing a robust, adaptable, and scalable approach.

A Scalable Approach to Covariate and Concept Drift Management via Adaptive Data Segmentation

TL;DR

This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner, providing a robust, adaptable, and scalable approach to maintaining high-performance ML systems across various applications.

Abstract

In many real-world applications, continuous machine learning (ML) systems are crucial but prone to data drift, a phenomenon where discrepancies between historical training data and future test data lead to significant performance degradation and operational inefficiencies. Traditional drift adaptation methods typically update models using ensemble techniques, often discarding drifted historical data, and focus primarily on either covariate drift or concept drift. These methods face issues such as high resource demands, inability to manage all types of drifts effectively, and neglecting the valuable context that historical data can provide. We contend that explicitly incorporating drifted data into the model training process significantly enhances model accuracy and robustness. This paper introduces an advanced framework that integrates the strengths of data-centric approaches with adaptive management of both covariate and concept drift in a scalable and efficient manner. Our framework employs sophisticated data segmentation techniques to identify optimal data batches that accurately reflect test data patterns. These data batches are then utilized for training on test data, ensuring that the models remain relevant and accurate over time. By leveraging the advantages of both data segmentation and scalable drift management, our solution ensures robust model accuracy and operational efficiency in large-scale ML deployments. It also minimizes resource consumption and computational overhead by selecting and utilizing relevant data subsets, leading to significant cost savings. Experimental results on classification task on real-world and synthetic datasets show our approach improves model accuracy while reducing operational costs and latency. This practical solution overcomes inefficiencies in current methods, providing a robust, adaptable, and scalable approach.

Paper Structure

This paper contains 18 sections, 5 equations, 4 figures, 4 tables, 2 algorithms.

Figures (4)

  • Figure 1: Covariate shift ranking of training batches 1,2,3,4 and 5 with respect to test point shown in yellow. In the leaf node, batches are ranked as 3,4,1,5 and 2. Here, batch 3 has lowest covariate shift.
  • Figure 2: Subsegment selection approach using gradient based disparity and gain scores
  • Figure 3: Complete architecture of our method. The top branch focuses on concept drift while bottom branch ranks train batches based on covariate shift.
  • Figure 4: Trade-off between $\%$ of data used vs accuracy.