Table of Contents
Fetching ...

Automated Data Quality Validation in an End-to-End GNN Framework

Sijie Dong, Soror Sahri, Themis Palpanas, Qitong Wang

TL;DR

The paper addresses data quality validation in ML pipelines by introducing DQuaG, an end-to-end GNN-based framework with a dual-decoder system for validation and repair that operates without expert-defined constraints. It learns normal feature dependencies from a clean data baseline using a GAT+GIN encoder to construct a feature graph and employs reconstruction-based criteria to detect anomalies and propose repairs, with a training objective LTotal = αLvalidation + βLrepair. The approach is validated on both synthetic and real-world datasets, showing superior accuracy and repair effectiveness compared to state-of-the-art constraint-based baselines, and demonstrates scalable performance on large datasets. The work advances practical data quality management by reducing reliance on manual constraint tuning and enabling automatic repair, which can enhance downstream ML performance and reliability, while outlining future directions for robustness, interpretability, and post-validation tasks.

Abstract

Ensuring data quality is crucial in modern data ecosystems, especially for training or testing datasets in machine learning. Existing validation approaches rely on computing data quality metrics and/or using expert-defined constraints. Although there are automated constraint generation methods, they are often incomplete and may be too strict or too soft, causing false positives or missed errors, thus requiring expert adjustment. These methods may also fail to detect subtle data inconsistencies hidden by complex interdependencies within the data. In this paper, we propose DQuag, an end-to-end data quality validation and repair framework based on an improved Graph Neural Network (GNN) and multi-task learning. The proposed method incorporates a dual-decoder design: one for data quality validation and the other for data repair. Our approach captures complex feature relationships within tabular datasets using a multi-layer GNN architecture to automatically detect explicit and hidden data errors. Unlike previous methods, our model does not require manual input for constraint generation and learns the underlying feature dependencies, enabling it to identify complex hidden errors that traditional systems often miss. Moreover, it can recommend repair values, improving overall data quality. Experimental results validate the effectiveness of our approach in identifying and resolving data quality issues. The paper appeared in EDBT 2025.

Automated Data Quality Validation in an End-to-End GNN Framework

TL;DR

The paper addresses data quality validation in ML pipelines by introducing DQuaG, an end-to-end GNN-based framework with a dual-decoder system for validation and repair that operates without expert-defined constraints. It learns normal feature dependencies from a clean data baseline using a GAT+GIN encoder to construct a feature graph and employs reconstruction-based criteria to detect anomalies and propose repairs, with a training objective LTotal = αLvalidation + βLrepair. The approach is validated on both synthetic and real-world datasets, showing superior accuracy and repair effectiveness compared to state-of-the-art constraint-based baselines, and demonstrates scalable performance on large datasets. The work advances practical data quality management by reducing reliance on manual constraint tuning and enabling automatic repair, which can enhance downstream ML performance and reliability, while outlining future directions for robustness, interpretability, and post-validation tasks.

Abstract

Ensuring data quality is crucial in modern data ecosystems, especially for training or testing datasets in machine learning. Existing validation approaches rely on computing data quality metrics and/or using expert-defined constraints. Although there are automated constraint generation methods, they are often incomplete and may be too strict or too soft, causing false positives or missed errors, thus requiring expert adjustment. These methods may also fail to detect subtle data inconsistencies hidden by complex interdependencies within the data. In this paper, we propose DQuag, an end-to-end data quality validation and repair framework based on an improved Graph Neural Network (GNN) and multi-task learning. The proposed method incorporates a dual-decoder design: one for data quality validation and the other for data repair. Our approach captures complex feature relationships within tabular datasets using a multi-layer GNN architecture to automatically detect explicit and hidden data errors. Unlike previous methods, our model does not require manual input for constraint generation and learns the underlying feature dependencies, enabling it to identify complex hidden errors that traditional systems often miss. Moreover, it can recommend repair values, improving overall data quality. Experimental results validate the effectiveness of our approach in identifying and resolving data quality issues. The paper appeared in EDBT 2025.

Paper Structure

This paper contains 22 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Examples of Data Errors in tabular data: anomalies, typos, and conflicts between attributes.
  • Figure 2: Data Quality Validation Framework Using GNN. Top: Training on clean data. Bottom: Validating unseen data by reconstruction error comparison.
  • Figure 3: Accuracy across different methods and two datasets with real-world data errors. (All methods have Recall=1)
  • Figure 4: Scalability Analysis: data quality validation time of our approach, when varying the data dimensionality and data size, on the New York Taxi dataset.