Table of Contents
Fetching ...

A Survey on Data Quality Dimensions and Tools for Machine Learning

Yuhan Zhou, Fengjiao Tu, Kewei Sha, Junhua Ding, Haihua Chen

TL;DR

The paper addresses the challenge of data quality in machine learning by organizing a framework around four DQ dimensions and twelve ML-oriented metrics, and by surveying 17 open-source data quality tools developed in the last five years. It provides a comparative analysis of these tools across core functions such as profiling, measurement, transformation, and monitoring, and proposes a roadmap for building accessible, AI-assisted data quality tools tailored to ML pipelines. The contributions include a consolidated DQ taxonomy, an evaluation of current tools, and practical guidance for tool development that integrates AI trends such as large language models and low-code interfaces. The work aims to advance data-centric AI by clarifying how data quality evaluation and improvement can be standardized, automated, and extended to support robust, fair, and scalable ML systems. Overall, it serves as a reference for researchers and practitioners looking to design, adopt, or extend open-source DQ tooling in ML workflows.

Abstract

Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.

A Survey on Data Quality Dimensions and Tools for Machine Learning

TL;DR

The paper addresses the challenge of data quality in machine learning by organizing a framework around four DQ dimensions and twelve ML-oriented metrics, and by surveying 17 open-source data quality tools developed in the last five years. It provides a comparative analysis of these tools across core functions such as profiling, measurement, transformation, and monitoring, and proposes a roadmap for building accessible, AI-assisted data quality tools tailored to ML pipelines. The contributions include a consolidated DQ taxonomy, an evaluation of current tools, and practical guidance for tool development that integrates AI trends such as large language models and low-code interfaces. The work aims to advance data-centric AI by clarifying how data quality evaluation and improvement can be standardized, automated, and extended to support robust, fair, and scalable ML systems. Overall, it serves as a reference for researchers and practitioners looking to design, adopt, or extend open-source DQ tooling in ML workflows.

Abstract

Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.
Paper Structure (14 sections, 3 figures, 3 tables)

This paper contains 14 sections, 3 figures, 3 tables.

Figures (3)

  • Figure 1: Evolution of DQ evaluation/improvement tools across functions over time. The 6 core functions are data loading, data profiling, data integration, data transformation, automation and monitoring, and output and reports. Every tool supports the loading and output functions so the middle four remain for discussion. The length of each tool shows its coverage of the functions and the color indicates the last year that the tool was updated.
  • Figure 2: DQ dimensions, metrics, and corresponding tools. It showcases 4 dimensions and 12 DQ metrics in the first and second rows. Beneath each one, corresponding tools are listed, indicating their evaluation focus on the specific metrics and dimensions. The color of each tool represents the last year that the tool was updated as shown in the middle bottom corner. Business rule is listed at the left as an additional aspect as many tools support customized rules.
  • Figure 3: Workflow of the data quality evaluation and improvement. The orange figures represent the parts of ML model constructions, the blue ones are the functions of DQ tools, and the metrics are shown in green boxes below each step.Specific ML tasks set certain DQ requirements, leading to dataset collection and subsequent evaluation and improvement. Finally, model performance reflects the effectiveness of the DQ improvement process.