PAT: Pattern-Perceptive Transformer for Error Detection in Relational Databases
Jian Fu, Xixian Han, Xiaolong Wan, Wenjian Wang
TL;DR
PAT introduces an attribute-aware Pattern-Perceptive Transformer for error detection in relational databases. It uses a Quasi-Tokens Arrangement (QTA) tokenizer to produce fixed-size data tokens and interleaves them with learnable attribute-pattern tokens, enabling joint learning of shared data features and attribute-specific patterns with embeddings of dimension $D$ across $N$ tokens. The method achieves superior F1 on diverse real-world and synthetic datasets and reduces parameters and FLOPs with a compact QTA mode (vs standard multi-detector or large LLM baselines). It also provides attention-based explanations to locate and potentially repair erroneous cells, enhancing data cleaning and assessment workflows.
Abstract
Error detection in relational databases is critical for maintaining data quality and is fundamental to tasks such as data cleaning and assessment. Current error detection studies mostly employ the multi-detector approach to handle heterogeneous attributes in databases, incurring high costs. Additionally, their data preprocessing strategies fail to leverage the variable-length characteristic of data sequences, resulting in reduced accuracy. In this paper, we propose an attribute-wise PAttern-perceptive Transformer (PAT) framework for error detection in relational databases. First, PAT introduces a learned pattern module that captures attribute-specific data distributions through learned embeddings during model training. Second, the Quasi-Tokens Arrangement (QTA) tokenizer is designed to divide the cell sequence based on its length and word types, and then generate the word-adaptive data tokens, meanwhile providing compact hyperparameters to ensure efficiency. By interleaving data tokens with the attribute-specific pattern tokens, PAT jointly learns shared data features across different attributes and pattern features that are distinguishable and unique in each specified attribute. Third, PAT visualizes the attention map to interpret its error detection mechanism. Extensive experiments show that PAT achieves excellent F1 scores compared to state-of-the-art data error detection methods. Moreover, PAT significantly reduces the model parameters and FLOPs when applying the compact QTA tokenizer.
