CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping
Veera V S Bhargav Nunna, Shinae Kang, Zheyuan Zhou, Virginia Wang, Sucharitha Boinapally, Michael Foley
TL;DR
The paper tackles the challenge of grouping semantically identical columns in large enterprise datasets containing non-semantic data by introducing a Character-Level Autoencoder (CAE). The method uses ASCII-based 1-of-$m$ character encoding and two-stage processing (Character-Level Encoding plus Auto-Encoder) with a fixed input size, learning dense embeddings from column patterns rather than semantics. Empirical results on WikiTableQuestions show that the Alternative Convolutional CAE achieves Top-1 76.19% and Top-5 85.71% accuracy, substantially outperforming traditional dictionary-based and semantic NLP baselines. This approach offers scalable, robust data profiling and schema understanding for large-scale data warehouses, mitigating issues related to formatting, abbreviations, and out-of-vocabulary terms, and it lays groundwork for integration with data catalogs and governance platforms.
Abstract
Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.
