Auto-Test: Learning Semantic-Domain Constraints for Unsupervised Error Detection in Tables
Qixu Chen, Yeye He, Raymond Chi-Wing Wong, Weiwei Cui, Song Ge, Haidong Zhang, Dongmei Zhang, Surajit Chaudhuri
TL;DR
The paper tackles data cleaning by moving beyond manually specified per-table constraints to automatically learned Semantic-Domain Constraints (SDC) that capture column semantics and error patterns. It introduces Auto-Test, an unsupervised framework that generates, statistically validates, and optimally selects SDCs from large table corpora using domain-evaluation functions across CTA, embeddings, patterns, and validation functions, with LP-relaxation-based guarantees. Empirical results show that Fine-Select, a calibrated SDC subset, substantially outperforms baselines across real-world benchmarks and generalizes across table types, while remaining efficient for interactive use. The approach also uncovers new, correct constraints and errors not covered by existing data-cleaning benchmarks, demonstrating practical utility and complementarity to traditional constraint-based methods. Overall, Auto-Test provides a scalable, explainable pathway to democratize and augment data cleaning through learned semantic-domain constraints.
Abstract
Data cleaning is a long-standing challenge in data management. While powerful logic and statistical algorithms have been developed to detect and repair data errors in tables, existing algorithms predominantly rely on domain-experts to first manually specify data-quality constraints specific to a given table, before data cleaning algorithms can be applied. In this work, we propose a new class of data-quality constraints that we call Semantic-Domain Constraints, which can be reliably inferred and automatically applied to any tables, without requiring domain-experts to manually specify on a per-table basis. We develop a principled framework to systematically learn such constraints from table corpora using large-scale statistical tests, which can further be distilled into a core set of constraints using our optimization framework, with provable quality guarantees. Extensive evaluations show that this new class of constraints can be used to both (1) directly detect errors on real tables in the wild, and (2) augment existing expert-driven data-cleaning techniques as a new class of complementary constraints. Our extensively labeled benchmark dataset with 2400 real data columns, as well as our code are available at https://github.com/qixuchen/AutoTest to facilitate future research.
