Table of Contents
Fetching ...

How to Define the Quality of Data? A Feature-Based Literature Survey

Markus Matoni, Arno Kesper, Gabriele Taentzer

TL;DR

This work addresses the lack of a unified definition for data quality by conducting a large-scale meta-study that identifies publications defining DQ through dimensions. It employs Feature-Oriented Domain Analysis (FODA) to build a comprehensive taxonomy that classifies DQ definitions across data types, representations, domains, and contextual relationships. The authors analyze 35 publications (out of over 17,000) to reveal patterns, gaps, and the need for standardization of DQ dimensions, domain-specific quasi-standards, and societal context integration. They propose that a declarative core for DQ, supplemented by examples, metrics, guidelines, and principled standardization, will enable more consistent DQ assessment and better data-intensive systems. The findings underscore the importance of harmonizing definitions across domains and recognizing evolving data types, while suggesting concrete directions for future DQ frameworks and evaluation methods.

Abstract

The digital transformation of our society is a constant challenge, as data is generated in almost every digital interaction. To use data effectively, it must be of high quality. This raises the question: what exactly is data quality? A systematic literature review of the existing literature shows that data quality is a multifaceted concept, characterized by a number of quality dimensions. However, the definitions of data quality vary widely. We used feature-oriented domain analysis to specify a taxonomy of data quality definitions and to classify the existing definitions. This allows us to identify research gaps and future topics.

How to Define the Quality of Data? A Feature-Based Literature Survey

TL;DR

This work addresses the lack of a unified definition for data quality by conducting a large-scale meta-study that identifies publications defining DQ through dimensions. It employs Feature-Oriented Domain Analysis (FODA) to build a comprehensive taxonomy that classifies DQ definitions across data types, representations, domains, and contextual relationships. The authors analyze 35 publications (out of over 17,000) to reveal patterns, gaps, and the need for standardization of DQ dimensions, domain-specific quasi-standards, and societal context integration. They propose that a declarative core for DQ, supplemented by examples, metrics, guidelines, and principled standardization, will enable more consistent DQ assessment and better data-intensive systems. The findings underscore the importance of harmonizing definitions across domains and recognizing evolving data types, while suggesting concrete directions for future DQ frameworks and evaluation methods.

Abstract

The digital transformation of our society is a constant challenge, as data is generated in almost every digital interaction. To use data effectively, it must be of high quality. This raises the question: what exactly is data quality? A systematic literature review of the existing literature shows that data quality is a multifaceted concept, characterized by a number of quality dimensions. However, the definitions of data quality vary widely. We used feature-oriented domain analysis to specify a taxonomy of data quality definitions and to classify the existing definitions. This allows us to identify research gaps and future topics.

Paper Structure

This paper contains 52 sections, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Systematic literature review methodology: workflow. The #Publications refers to the number of search results after applying the search string.
  • Figure 2: FODA: Feature Model: Legend
  • Figure 3: FODA: Feature Model: Root feature (legend: cf. \ref{['fig:foda_legend']})
  • Figure 4: FODA: Feature Model: data feature (legend: cf. \ref{['fig:foda_legend']})
  • Figure 5: FODA: Feature Model: dataset type feature (legend: cf. \ref{['fig:foda_legend']})
  • ...and 4 more figures