Table of Contents
Fetching ...

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

Qixu Chen, Yeye He, Raymond Chi-Wing Wong, Weiwei Cui, Song Ge, Haidong Zhang, Dongmei Zhang, Surajit Chaudhuri

TL;DR

The paper tackles data cleaning by moving beyond manually specified per-table constraints to automatically learned Semantic-Domain Constraints (SDC) that capture column semantics and error patterns. It introduces Auto-Test, an unsupervised framework that generates, statistically validates, and optimally selects SDCs from large table corpora using domain-evaluation functions across CTA, embeddings, patterns, and validation functions, with LP-relaxation-based guarantees. Empirical results show that Fine-Select, a calibrated SDC subset, substantially outperforms baselines across real-world benchmarks and generalizes across table types, while remaining efficient for interactive use. The approach also uncovers new, correct constraints and errors not covered by existing data-cleaning benchmarks, demonstrating practical utility and complementarity to traditional constraint-based methods. Overall, Auto-Test provides a scalable, explainable pathway to democratize and augment data cleaning through learned semantic-domain constraints.

Abstract

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.

Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables

TL;DR

The paper tackles data cleaning by moving beyond manually specified per-table constraints to automatically learned Semantic-Domain Constraints (SDC) that capture column semantics and error patterns. It introduces Auto-Test, an unsupervised framework that generates, statistically validates, and optimally selects SDCs from large table corpora using domain-evaluation functions across CTA, embeddings, patterns, and validation functions, with LP-relaxation-based guarantees. Empirical results show that Fine-Select, a calibrated SDC subset, substantially outperforms baselines across real-world benchmarks and generalizes across table types, while remaining efficient for interactive use. The approach also uncovers new, correct constraints and errors not covered by existing data-cleaning benchmarks, demonstrating practical utility and complementarity to traditional constraint-based methods. Overall, Auto-Test provides a scalable, explainable pathway to democratize and augment data cleaning through learned semantic-domain constraints.

Abstract

Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.

Paper Structure

This paper contains 25 sections, 3 theorems, 19 equations, 20 figures, 12 tables, 1 algorithm.

Key Result

theorem 1

The CSS problem is NP-hard and cannot be approximated with a factor of $(1-1/e)$, unless $NP \subseteq DTIME(n^{O(\log \log n)})$.

Figures (20)

  • Figure 1: Example data cleaning feature for end-users in Microsoft Excel. Data quality issues in user tables are automatically detected using techniques such as HH18wang2019unichakrabarti2016dataxing2024table, and are presented as intuitive "suggestion cards" on the side-pane (right), for users to review and accept. [https://drive.google.com/file/d/1kIVLVOZQfZn2Dqd2M-fblo7EwpP_O4tw/view?usp=drive_link]clean-data-demo gives an end-to-end demo of how users can leverage such automated capabilities to easily clean data (without needing to define any constraints first), while staying in full control over any suggested changes that may be applied to their data.
  • Figure 2: Real examples of table columns, each representing a distinct "semantic domain" (annotated in the column header). Each column $C_i$ has a real data error (which may be a typo, or a semantically incompatible value), that is detected by a corresponding "semantic domain constraint" $r_i$ in Table \ref{['tab:rule_example']}, which are constraints automatically learned from running Auto-Test.
  • Figure 3: Real examples of table columns, where false-positive detection of errors are produced in highlighted cells, when existing column-type detection techniques are used directly to the task of error-detection.
  • Figure 4: Visual illustration of a constraint $r_t = (P, S, c)$, where the inner-ball with radius $d_{in}$ corresponds to the pre-condition $P$, the outer-ball with radius $d_{out}$ corresponds to the post-condition $S$. For a column $C = \{v_1, v_2, v_3, v_4, v_5\}$, $v_1$, $v_2$ and $v_3$ fall inside the inner-ball (indicating that these values are are likely in the domain of type $t$), while $v_5$ falls outside of the outer-ball (and likely not in the type $t$).
  • Figure 5: Architecture diagram of Auto-Test
  • ...and 15 more figures

Theorems & Definitions (13)

  • Example 1
  • Example 2
  • Definition 1
  • Definition 2
  • Example 3
  • Definition 3
  • Example 4
  • Example 5
  • Definition 4
  • theorem 1
  • ...and 3 more