Data Cleaning and Machine Learning: A Systematic Literature Review
Pierre-Olivier Côté, Amin Nikanjam, Nafisa Ahmed, Dmytro Humeniuk, Foutse Khomh
TL;DR
This systematic literature review maps the intersection of data cleaning and machine learning (DC&ML) by surveying 101 papers from 2016–2022 and classifying them into six data-cleaning activities: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. It contrasts data cleaning for ML with ML for data cleaning, highlights transformer-based and related approaches as leading techniques, and details evaluation metrics, datasets, and baselines. The study offers 24 future directions, including data augmentation, public data-cleaning datasets, tooling, LLMs, holistic cleaning, and interactive cleaning, to accelerate progress in DC&ML. By providing a replication package and a rigorous taxonomy, the work aims to guide researchers and practitioners toward more effective and practical data-cleaning solutions that improve ML performance and data quality at scale.
Abstract
Context: Machine Learning (ML) is integrated into a growing number of systems for various applications. Because the performance of an ML model is highly dependent on the quality of the data it has been trained on, there is a growing interest in approaches to detect and repair data errors (i.e., data cleaning). Researchers are also exploring how ML can be used for data cleaning; hence creating a dual relationship between ML and data cleaning. To the best of our knowledge, there is no study that comprehensively reviews this relationship. Objective: This paper's objectives are twofold. First, it aims to summarize the latest approaches for data cleaning for ML and ML for data cleaning. Second, it provides future work recommendations. Method: We conduct a systematic literature review of the papers published between 2016 and 2022 inclusively. We identify different types of data cleaning activities with and for ML: feature cleaning, label cleaning, entity matching, outlier detection, imputation, and holistic data cleaning. Results: We summarize the content of 101 papers covering various data cleaning activities and provide 24 future work recommendations. Our review highlights many promising data cleaning techniques that can be further extended. Conclusion: We believe that our review of the literature will help the community develop better approaches to clean data.
