Table of Contents
Fetching ...

MechDetect: Detecting Data-Dependent Errors

Philipp Jung, Nicholas Chandler, Sebastian Jäger, Felix Biessmann

TL;DR

MechDetect presents a data-driven framework to infer how data errors are generated in tabular datasets by leveraging an error mask and three training setups (Complete, Shuffled, Excluded). Through two statistically guarded tests and non-linear binary classifiers, it can distinguish MCAR from MAR/MNAR and further separate MAR from MNAR, with strong empirical performance across 101 datasets. The approach highlights the importance of understanding error-generation mechanisms for data cleaning and downstream tasks, while recognizing practical assumptions such as the availability of clean data and an error mask. Overall, MechDetect offers a principled, scalable method to diagnose data-dependent error mechanisms in real-world tabular data pipelines.

Abstract

Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.

MechDetect: Detecting Data-Dependent Errors

TL;DR

MechDetect presents a data-driven framework to infer how data errors are generated in tabular datasets by leveraging an error mask and three training setups (Complete, Shuffled, Excluded). Through two statistically guarded tests and non-linear binary classifiers, it can distinguish MCAR from MAR/MNAR and further separate MAR from MNAR, with strong empirical performance across 101 datasets. The approach highlights the importance of understanding error-generation mechanisms for data cleaning and downstream tasks, while recognizing practical assumptions such as the availability of clean data and an error mask. Overall, MechDetect offers a principled, scalable method to diagnose data-dependent error mechanisms in real-world tabular data pipelines.

Abstract

Data quality monitoring is a core challenge in modern information processing systems. While many approaches to detect data errors or shifts have been proposed, few studies investigate the mechanisms governing error generation. We argue that knowing how errors were generated can be key to tracing and fixing them. In this study, we build on existing work in the statistics literature on missing values and propose MechDetect, a simple algorithm to investigate error generation mechanisms. Given a tabular data set and a corresponding error mask, the algorithm estimates whether or not the errors depend on the data using machine learning models. Our work extends established approaches to detect mechanisms underlying missing values and can be readily applied to other error types, provided that an error mask is available. We demonstrate the effectiveness of MechDetect in experiments on established benchmark datasets.

Paper Structure

This paper contains 12 sections, 1 equation, 7 figures, 1 table, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of error generation mechanisms as formalized for missing values in the statistics literature. Arrows between columns denote a statistical dependency, E0 and E1 are the columns of $E$, the binary error mask of the respective table. Errors in the left table are independent from data, or Missing Completely At Random (MCAR). In the Missing At Random (MAR) example, Quests are missing if the corresponding Hero's name start with the letter P. Thus, there is a dependency between the error mask and the column Hero, which determines the missing values' positions. Finally, in the Missing Not At Random (MNAR) case, values in Quests are missing if they are greater than 5, meaning that $E$ depends on values in Quests.
  • Figure 2: Example of applying MechDetect to the column Quests. The leaves representing the dependency structure for the corresponding error mechanism. For MCAR, there is no dependency between $E$ and the data. In the case of MAR mechanism, the error distribution potentially depends on the column Hero, whereas for MNAR, $E$ additionally depends on the column Quests.
  • Figure 3: Accuracy of MechDetect classifying error mechanisms. The errors introduced are missing values, with an error rate of 0.5. We observe a mean accuracy of 89.04% at this error rate.
  • Figure 4: Mean accuracy of MechDetect as a function of the error rate. Colored areas around individual data points indicate the 95% confidence interval for the mean.
  • Figure 5: AUC-ROC scores of classifiers predicting the error mask from data. If errors are independent of the data (MCAR and Shuffled setting), classifiers perform at chance level, as expected. If errors depend on data (MAR and MNAR), the Complete and Excluded classifiers achieve above chance performance, often close to AUC-ROC scores near 1.0.
  • ...and 2 more figures