Table of Contents
Fetching ...

Cost-Effective Retraining of Machine Learning Models

Ananth Mahadevan, Michael Mathioudakis

TL;DR

This paper addresses the costly problem of retraining ML models under data drift by introducing Cost-Aware Retraining Algorithm (Cara), which optimizes retrain-or-keep decisions using both model staleness and retraining costs. It defines staleness cost and retraining cost, builds a cost matrix, and presents three Cara variants (threshold, cumulative threshold, periodic) plus an Oracle baseline solved via dynamic programming. Empirical results on synthetic and real-world datasets show Cara achieves near-Oracle strategy costs and competitive query accuracy while performing fewer retraining decisions than drift-detection baselines. The work demonstrates practical, cost-aware mechanisms for online maintenance of deployed models in streaming contexts and highlights future avenues for scalability and learning-based decision policies.

Abstract

It is important to retrain a machine learning (ML) model in order to maintain its performance as the data changes over time. However, this can be costly as it usually requires processing the entire dataset again. This creates a trade-off between retraining too frequently, which leads to unnecessary computing costs, and not retraining often enough, which results in stale and inaccurate ML models. To address this challenge, we propose ML systems that make automated and cost-effective decisions about when to retrain an ML model. We aim to optimize the trade-off by considering the costs associated with each decision. Our research focuses on determining whether to retrain or keep an existing ML model based on various factors, including the data, the model, and the predictive queries answered by the model. Our main contribution is a Cost-Aware Retraining Algorithm called Cara, which optimizes the trade-off over streams of data and queries. To evaluate the performance of Cara, we analyzed synthetic datasets and demonstrated that Cara can adapt to different data drifts and retraining costs while performing similarly to an optimal retrospective algorithm. We also conducted experiments with real-world datasets and showed that Cara achieves better accuracy than drift detection baselines while making fewer retraining decisions, ultimately resulting in lower total costs.

Cost-Effective Retraining of Machine Learning Models

TL;DR

This paper addresses the costly problem of retraining ML models under data drift by introducing Cost-Aware Retraining Algorithm (Cara), which optimizes retrain-or-keep decisions using both model staleness and retraining costs. It defines staleness cost and retraining cost, builds a cost matrix, and presents three Cara variants (threshold, cumulative threshold, periodic) plus an Oracle baseline solved via dynamic programming. Empirical results on synthetic and real-world datasets show Cara achieves near-Oracle strategy costs and competitive query accuracy while performing fewer retraining decisions than drift-detection baselines. The work demonstrates practical, cost-aware mechanisms for online maintenance of deployed models in streaming contexts and highlights future avenues for scalability and learning-based decision policies.

Abstract

It is important to retrain a machine learning (ML) model in order to maintain its performance as the data changes over time. However, this can be costly as it usually requires processing the entire dataset again. This creates a trade-off between retraining too frequently, which leads to unnecessary computing costs, and not retraining often enough, which results in stale and inaccurate ML models. To address this challenge, we propose ML systems that make automated and cost-effective decisions about when to retrain an ML model. We aim to optimize the trade-off by considering the costs associated with each decision. Our research focuses on determining whether to retrain or keep an existing ML model based on various factors, including the data, the model, and the predictive queries answered by the model. Our main contribution is a Cost-Aware Retraining Algorithm called Cara, which optimizes the trade-off over streams of data and queries. To evaluate the performance of Cara, we analyzed synthetic datasets and demonstrated that Cara can adapt to different data drifts and retraining costs while performing similarly to an optimal retrospective algorithm. We also conducted experiments with real-world datasets and showed that Cara achieves better accuracy than drift detection baselines while making fewer retraining decisions, ultimately resulting in lower total costs.
Paper Structure (36 sections, 16 equations, 11 figures, 7 tables, 3 algorithms)

This paper contains 36 sections, 16 equations, 11 figures, 7 tables, 3 algorithms.

Figures (11)

  • Figure 1: First scenario. (a) Initial data $D\xspace_0$ and model $M\xspace_0$. Concept drift occurs at $t=1$. (b) Queries are far from misclassifications. (c) Queries are close to misclassifications.
  • Figure 2: Second scenario. (a)-(c) Data has no concept or covariate drift in batches $t=1$ till $t=3$. Queries show covariate shift, moving from being far from the decision boundary in (a) to being closer to the misclassifications in (c).
  • Figure 3: (a) Varying query distribution. The red lines indicate the offline optimal Cara-T strategy (markers and solid lines correspond to Retrain and Keep decisions respectively). (b) Varying the retraining cost.
  • Figure 4: Strategy cost, number of retrains and query accuracy as a function of retraining cost for the CovCon-D dataset.
  • Figure 5: Strategy cost, number of retrains and query accuracy as a function of retraining cost for the real-world datasets.
  • ...and 6 more figures