Table of Contents
Fetching ...

Automatic String Data Validation with Pattern Discovery

Xinwei Lin, Jing Zhao, Peng Di, Chuan Xiao, Rui Mao, Yan Ji, Makoto Onizuka, Zishuo Ding, Weiyi Shang, Jianbin Qin

TL;DR

The paper tackles data quality issues from semi-structured string data in enterprise data pipelines by introducing a self-validate data management system that learns data patterns automatically. It employs a two-step pattern discovery: skeleton extraction to uncover high-level structural constraints and fine-grained semantics extraction to capture character-level validity via an entropy-based cost, guided by a domain-specific language and a generalization tree. Key contributions include skeleton and semantics algorithms, a pattern-based distance measure, data augmentation, and incremental updates, with experimental results showing high precision and recall and a real-world deployment at Ant Group. The approach improves early error detection, reduces manual constraint writing, and scales to industrial data volumes, enhancing data quality and engineer productivity.

Abstract

In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.

Automatic String Data Validation with Pattern Discovery

TL;DR

The paper tackles data quality issues from semi-structured string data in enterprise data pipelines by introducing a self-validate data management system that learns data patterns automatically. It employs a two-step pattern discovery: skeleton extraction to uncover high-level structural constraints and fine-grained semantics extraction to capture character-level validity via an entropy-based cost, guided by a domain-specific language and a generalization tree. Key contributions include skeleton and semantics algorithms, a pattern-based distance measure, data augmentation, and incremental updates, with experimental results showing high precision and recall and a real-world deployment at Ant Group. The approach improves early error detection, reduces manual constraint writing, and scales to industrial data volumes, enhancing data quality and engineer productivity.

Abstract

In enterprise data pipelines, data insertions occur periodically and may impact downstream services if data quality issues are not addressed. Typically, such problems can be investigated and fixed by on-call engineers, but locating the cause of such problems and fixing errors are often time-consuming. Therefore, automatic data validation is a better solution to defend the system and downstream services by enabling early detection of errors and providing detailed error messages for quick resolution. This paper proposes a self-validate data management system with automatic pattern discovery techniques to verify the correctness of semi-structural string data in enterprise data pipelines. Our solution extracts patterns from historical data and detects erroneous incoming data in a top-down fashion. High-level information of historical data is analyzed to discover the format skeleton of correct values. Fine-grained semantic patterns are then extracted to strike a balance between generalization and specification of the discovered pattern, thus covering as many correct values as possible while avoiding over-fitting. To tackle cold start and rapid data growth, we propose an incremental update strategy and example generalization strategy. Experiments on large-scale industrial and public datasets demonstrate the effectiveness and efficiency of our method compared to alternative solutions. Furthermore, a case study on an industrial platform (Ant Group Inc.) with thousands of applications shows that our system captures meaningful data patterns in daily operations and helps engineers quickly identify errors.
Paper Structure (24 sections, 10 equations, 15 figures, 1 table, 5 algorithms)

This paper contains 24 sections, 10 equations, 15 figures, 1 table, 5 algorithms.

Figures (15)

  • Figure 1: Data pipelines with data validation
  • Figure 2: Complex nested data examples.
  • Figure 3: Self-validate data management system framework.
  • Figure 4: Base type and structural definition
  • Figure 5: Syntax and semantics of pattern
  • ...and 10 more figures