A survey of open-source data quality tools: shedding light on the materialization of data quality dimensions in practice
Vasileios Papastergios, Anastasios Gounaris
TL;DR
Data quality is multi-dimensional yet practitioners face inconsistent terminology and unclear mappings between tool functionalities and the ISO/IEC 25012 standard. The authors survey six open-source DQ tools—Deequ, dbt Core, MobyDQ, Great Expectations, Soda Core, and Apache Griffin—extract low-level checks from code and documentation, and map them to the standard's dimensions to reveal many-to-many relationships and practical engineering considerations. They find that six inherent ISO/IEC 25012 dimensions are evidenced by built-in checks across the tools, accompanied by detailed engineering insights and an appendix of function names. The study advocates standardizing terminology and extending coverage to streaming data, aiming to connect concrete checks with theoretical DQ concepts to improve tool selection and metric design.
Abstract
Data Quality (DQ) describes the degree to which data characteristics meet requirements and are fit for use by humans and/or systems. There are several aspects in which DQ can be measured, called DQ dimensions (i.e. accuracy, completeness, consistency, etc.), also referred to as characteristics in literature. ISO/IEC 25012 Standard defines a data quality model with fifteen such dimensions, setting the requirements a data product should meet. In this short report, we aim to bridge the gap between lower-level functionalities offered by DQ tools and higher-level dimensions in a systematic manner, revealing the many-to-many relationships between them. To this end, we examine 6 open-source DQ tools and we emphasize on providing a mapping between the functionalities they offer and the DQ dimensions, as defined by the ISO standard. Wherever applicable, we also provide insights into the software engineering details that tools leverage, in order to address DQ challenges.
