Table of Contents
Fetching ...

Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy

Sedir Mohammed, Felix Naumann, Hazar Harmouch

TL;DR

This work addresses data cleaning under budget constraints to improve ML prediction accuracy by introducing Co-met, a stepwise, cost-aware framework that guides which feature to clean next. Co-met learns the impact of cleaning actions via incremental data pollution and a Bayesian regression model that predicts per-feature gains and uncertainties, then ranks features using a cost-adjusted score. Across seven classification datasets, four ML algorithms, and four error types, Co-met achieves up to $52$ percentage points in $F1$ score improvement and an average gain of about $5$ points over baselines, demonstrating robust, task-aware cleaning. The approach highlights the practical value of integrating data cleaning with downstream ML objectives in data-centric AI and suggests avenues for extending to other tasks and error types.

Abstract

Data quality is crucial in machine learning (ML) applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However, in practical scenarios, data cleaning incurs significant costs, as it often involves domain experts for configuring and executing the cleaning process. Thus, efficient resource allocation during data cleaning can enhance ML prediction accuracy while controlling expenses. This paper presents COMET, a system designed to optimize data cleaning efforts for ML tasks. COMET gives step-by-step recommendations on which feature to clean next, maximizing the efficiency of data cleaning under resource constraints. We evaluated COMET across various datasets, ML algorithms, and data error types, demonstrating its robustness and adaptability. Our results show that COMET consistently outperforms feature importance-based, random, and another well-known cleaning method, achieving up to 52 and on average 5 percentage points higher ML prediction accuracy than the proposed baselines.

Step-by-Step Data Cleaning Recommendations to Improve ML Prediction Accuracy

TL;DR

This work addresses data cleaning under budget constraints to improve ML prediction accuracy by introducing Co-met, a stepwise, cost-aware framework that guides which feature to clean next. Co-met learns the impact of cleaning actions via incremental data pollution and a Bayesian regression model that predicts per-feature gains and uncertainties, then ranks features using a cost-adjusted score. Across seven classification datasets, four ML algorithms, and four error types, Co-met achieves up to percentage points in score improvement and an average gain of about points over baselines, demonstrating robust, task-aware cleaning. The approach highlights the practical value of integrating data cleaning with downstream ML objectives in data-centric AI and suggests avenues for extending to other tasks and error types.

Abstract

Data quality is crucial in machine learning (ML) applications, as errors in the data can significantly impact the prediction accuracy of the underlying ML model. Therefore, data cleaning is an integral component of any ML pipeline. However, in practical scenarios, data cleaning incurs significant costs, as it often involves domain experts for configuring and executing the cleaning process. Thus, efficient resource allocation during data cleaning can enhance ML prediction accuracy while controlling expenses. This paper presents COMET, a system designed to optimize data cleaning efforts for ML tasks. COMET gives step-by-step recommendations on which feature to clean next, maximizing the efficiency of data cleaning under resource constraints. We evaluated COMET across various datasets, ML algorithms, and data error types, demonstrating its robustness and adaptability. Our results show that COMET consistently outperforms feature importance-based, random, and another well-known cleaning method, achieving up to 52 and on average 5 percentage points higher ML prediction accuracy than the proposed baselines.

Paper Structure

This paper contains 26 sections, 5 equations, 27 figures, 1 table.

Figures (27)

  • Figure 1: Co-met incrementally pollutes features and based on the observed negative effect of the pollution on the prediction accuracy, it estimates the positive impact of data cleaning.
  • Figure 2: Co-met workflow for an individual error type: (1) Polluter: Introduce further pollution; (2) Estimator: Evaluate pollution/cleaning effects on ML model accuracy; (3) Recommender: Propose feature-wise cleaning strategies based on scoring.
  • Figure 3: Comparison of Co-met with the baselines for SVM across multiple error types and cost functions.
  • Figure 4: Comparison of Co-met with AC for LIR across multiple error types and cost functions.
  • Figure 5: Comparison of Co-met with the baselines FIR, RR and CL for MLP across error types.
  • ...and 22 more figures