Table of Contents
Fetching ...

CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

Veera V S Bhargav Nunna, Shinae Kang, Zheyuan Zhou, Virginia Wang, Sucharitha Boinapally, Michael Foley

TL;DR

The paper tackles the challenge of grouping semantically identical columns in large enterprise datasets containing non-semantic data by introducing a Character-Level Autoencoder (CAE). The method uses ASCII-based 1-of-$m$ character encoding and two-stage processing (Character-Level Encoding plus Auto-Encoder) with a fixed input size, learning dense embeddings from column patterns rather than semantics. Empirical results on WikiTableQuestions show that the Alternative Convolutional CAE achieves Top-1 76.19% and Top-5 85.71% accuracy, substantially outperforming traditional dictionary-based and semantic NLP baselines. This approach offers scalable, robust data profiling and schema understanding for large-scale data warehouses, mitigating issues related to formatting, abbreviations, and out-of-vocabulary terms, and it lays groundwork for integration with data catalogs and governance platforms.

Abstract

Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.

CAE: Character-Level Autoencoder for Non-Semantic Relational Data Grouping

TL;DR

The paper tackles the challenge of grouping semantically identical columns in large enterprise datasets containing non-semantic data by introducing a Character-Level Autoencoder (CAE). The method uses ASCII-based 1-of- character encoding and two-stage processing (Character-Level Encoding plus Auto-Encoder) with a fixed input size, learning dense embeddings from column patterns rather than semantics. Empirical results on WikiTableQuestions show that the Alternative Convolutional CAE achieves Top-1 76.19% and Top-5 85.71% accuracy, substantially outperforming traditional dictionary-based and semantic NLP baselines. This approach offers scalable, robust data profiling and schema understanding for large-scale data warehouses, mitigating issues related to formatting, abbreviations, and out-of-vocabulary terms, and it lays groundwork for integration with data catalogs and governance platforms.

Abstract

Enterprise relational databases increasingly contain vast amounts of non-semantic data - IP addresses, product identifiers, encoded keys, and timestamps - that challenge traditional semantic analysis. This paper introduces a novel Character-Level Autoencoder (CAE) approach that automatically identifies and groups semantically identical columns in non-semantic relational datasets by detecting column similarities based on data patterns and structures. Unlike conventional Natural Language Processing (NLP) models that struggle with limitations in semantic interpretability and out-of-vocabulary tokens, our approach operates at the character level with fixed dictionary constraints, enabling scalable processing of large-scale data lakes and warehouses. The CAE architecture encodes text representations of non-semantic relational table columns and extracts high-dimensional feature embeddings for data grouping. By maintaining a fixed dictionary size, our method significantly reduces both memory requirements and training time, enabling efficient processing of large-scale industrial data environments. Experimental evaluation demonstrates substantial performance gains: our CAE approach achieved 80.95% accuracy in top 5 column matching tasks across relational datasets, substantially outperforming traditional NLP approaches such as Bag of Words (47.62%). These results demonstrate its effectiveness for identifying and clustering identical columns in relational datasets. This work bridges the gap between theoretical advances in character-level neural architectures and practical enterprise data management challenges, providing an automated solution for schema understanding and data profiling of non-semantic industrial datasets at scale.

Paper Structure

This paper contains 21 sections, 3 equations, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: Character-Level Auto-Encoder framework: (1) Character-Level Encoding converts table column text into sparse matrices; (2) Auto-Encoder compresses these matrices into dense latent vectors and reconstructs the original encoding; (3) Latent vectors enable column grouping via cosine similarity measurement.
  • Figure 2: Two character-level encoding (CLE) approaches for column vector assembly: (a) Concatenated encoding sequentially joins entry vectors up to a length limit, and (b) Alternative CLE averages entry vectors to create a smoothed representation.
  • Figure 3: Sample WikiTableQuestions tables that should be grouped together on shared Award, Category, or Result columns.
  • Figure 4: Text length distributions across dataset columns: linear scale (left) and logarithmic scale (right), with the spike at 250 corresponding to the selected character cutoff threshold.
  • Figure 5: Reconstruction quality improvement over training epochs: Input encoding matrix (left) and their reconstructions (right). As the number of training epochs increases, the model progressively captures salient features and underlying patterns, resulting in reconstructions that more closely resemble the original encoded inputs.
  • ...and 2 more figures