Table of Contents
Fetching ...

Data Readiness for AI: A 360-Degree Survey

Kaveen Hiniduma, Suren Byna, Jean Luca Bez

TL;DR

This article addresses the need for standardized metrics to assess data readiness for AI (DRAI) by surveying over 140 sources across traditional data quality and AI-specific concerns. It introduces a six-pillar taxonomy that organizes metrics for both structured and unstructured data, including completeness, outliers, mislabeled data, privacy leakage, fairness, and FAIR compliance, and extends these concepts to textual and multimedia data. The authors discuss existing frameworks (e.g., DQT, AIDRIN, FAIR-related tools) and propose a comprehensive DRAI metric framework to guide data preparation for AI tasks, emphasizing both generic data-quality dimensions and AI-specific impacts like discrimination and bias. They also highlight gaps such as scalability, life-cycle assessment, and domain-specific adaptations, and argue for ongoing development of benchmarks and interpretable visualizations to support practitioners. Overall, the paper provides a foundational resource and taxonomy to standardize DRAI metrics, enabling improved data quality, fairness, and reliability in AI training and inference.

Abstract

Artificial Intelligence (AI) applications critically depend on data. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Evaluation of data readiness is a crucial step in improving the quality and appropriateness of data usage for AI. R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used to verify data readiness for AI training. This survey examines more than 140 papers published by ACM Digital Library, IEEE Xplore, journals such as Nature, Springer, and Science Direct, and online articles published by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy will lead to new standards for DRAI metrics that will be used for enhancing the quality, accuracy, and fairness of AI training and inference.

Data Readiness for AI: A 360-Degree Survey

TL;DR

This article addresses the need for standardized metrics to assess data readiness for AI (DRAI) by surveying over 140 sources across traditional data quality and AI-specific concerns. It introduces a six-pillar taxonomy that organizes metrics for both structured and unstructured data, including completeness, outliers, mislabeled data, privacy leakage, fairness, and FAIR compliance, and extends these concepts to textual and multimedia data. The authors discuss existing frameworks (e.g., DQT, AIDRIN, FAIR-related tools) and propose a comprehensive DRAI metric framework to guide data preparation for AI tasks, emphasizing both generic data-quality dimensions and AI-specific impacts like discrimination and bias. They also highlight gaps such as scalability, life-cycle assessment, and domain-specific adaptations, and argue for ongoing development of benchmarks and interpretable visualizations to support practitioners. Overall, the paper provides a foundational resource and taxonomy to standardize DRAI metrics, enabling improved data quality, fairness, and reliability in AI training and inference.

Abstract

Artificial Intelligence (AI) applications critically depend on data. Poor quality data produces inaccurate and ineffective AI models that may lead to incorrect or unsafe use. Evaluation of data readiness is a crucial step in improving the quality and appropriateness of data usage for AI. R&D efforts have been spent on improving data quality. However, standardized metrics for evaluating data readiness for use in AI training are still evolving. In this study, we perform a comprehensive survey of metrics used to verify data readiness for AI training. This survey examines more than 140 papers published by ACM Digital Library, IEEE Xplore, journals such as Nature, Springer, and Science Direct, and online articles published by prominent AI experts. This survey aims to propose a taxonomy of data readiness for AI (DRAI) metrics for structured and unstructured datasets. We anticipate that this taxonomy will lead to new standards for DRAI metrics that will be used for enhancing the quality, accuracy, and fairness of AI training and inference.
Paper Structure (36 sections, 3 figures, 2 tables)

This paper contains 36 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Papers chosen for this survey from different time frames.
  • Figure 2: 360° View of Mapping Data Readiness Dimensions for AI
  • Figure 3: A high-level view of data readiness metric categories for AI.