Table of Contents
Fetching ...

OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories

Christos Koutras, Jiani Zhang, Xiao Qin, Chuan Lei, Vasileios Ioannidis, Christos Faloutsos, George Karypis, Asterios Katsifodimos

TL;DR

Compared to the state-of-the-art matching and discovery methods, OmniMatch exhibits up to 14% higher effectiveness in F1 score and AUC without relying on metadata or user-provided thresholds for each similarity metric.

Abstract

How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value overlaps, hence missing important semantics or failing to handle noise in the data. At the same time, recent dataset discovery methods focusing on deep table representation learning techniques, do not take into consideration the rich set of column similarity signals found in prior matching and discovery methods. Finally, existing methods heavily depend on user-provided similarity thresholds, hindering their deployability in real-world settings. In this paper, we propose OmniMatch, a novel join discovery technique that detects equi-joins and fuzzy-joins betwen columns by combining column-pair similarity measures with Graph Neural Networks (GNNs). OmniMatch's GNN can capture column relatedness leveraging graph transitivity, significantly improving the recall of join discovery tasks. At the same time, OmniMatch also increases the precision by augmenting its training data with negative column join examples through an automated negative example generation process. Most importantly, compared to the state-of-the-art matching and discovery methods, OmniMatch exhibits up to 14% higher effectiveness in F1 score and AUC without relying on metadata or user-provided thresholds for each similarity metric.

OmniMatch: Effective Self-Supervised Any-Join Discovery in Tabular Data Repositories

TL;DR

Compared to the state-of-the-art matching and discovery methods, OmniMatch exhibits up to 14% higher effectiveness in F1 score and AUC without relying on metadata or user-provided thresholds for each similarity metric.

Abstract

How can we discover join relationships among columns of tabular data in a data repository? Can this be done effectively when metadata is missing? Traditional column matching works mainly rely on similarity measures based on exact value overlaps, hence missing important semantics or failing to handle noise in the data. At the same time, recent dataset discovery methods focusing on deep table representation learning techniques, do not take into consideration the rich set of column similarity signals found in prior matching and discovery methods. Finally, existing methods heavily depend on user-provided similarity thresholds, hindering their deployability in real-world settings. In this paper, we propose OmniMatch, a novel join discovery technique that detects equi-joins and fuzzy-joins betwen columns by combining column-pair similarity measures with Graph Neural Networks (GNNs). OmniMatch's GNN can capture column relatedness leveraging graph transitivity, significantly improving the recall of join discovery tasks. At the same time, OmniMatch also increases the precision by augmenting its training data with negative column join examples through an automated negative example generation process. Most importantly, compared to the state-of-the-art matching and discovery methods, OmniMatch exhibits up to 14% higher effectiveness in F1 score and AUC without relying on metadata or user-provided thresholds for each similarity metric.
Paper Structure (24 sections, 5 equations, 10 figures, 4 tables)

This paper contains 24 sections, 5 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: OmniMatch outperforms the state-of-the-art column matching and representation methods in terms of best F1 and Precision-Recall AUC scores achieved when tested upon real-world join benchmarks on open data repositories (§ \ref{['sec:exp']}). Best viewed in color.
  • Figure 2: OmniMatch at work: (best viewed in color) traditional similarity-based methods vs. OmniMatch. If the similarity-based threshold is set to 0.3 for Jaccard Similarity (JS) or to 0.5 for Set Containment (SC), traditional methods will miss the match between columns Cntry and CNTR. Choosing these thresholds is very hard in practice as those are use-case- and dataset-dependent. OmniMatch's GNN-based method is able to discover joins using graph neighborhood information, despite the low similarity between columns, without user-provided thresholds.
  • Figure 3: OmniMatch overview: (b) positive and negative join examples are generated in a self-supervised manner based on the original data repository shown in (a). For each positive and negative join pair, OmniMatch computes a set of similarity signals (c) and then constructs a similarity graph (d), which represents the most prominent column relationships among training data. The similarity graph and the join examples are the basis for producing column representations through a GNN and training a join prediction model, as shown in (e). For discovering joins, we repeat steps (c) and (d) for the original tabular datasets in the repository and use the trained model to infer joins among their columns. Best viewed in color.
  • Figure 4: Using Jaccard similarity on infrequent tokens and embedding similarity on frequent tokens for capturing fuzzy-joins.
  • Figure 5: For training, OmniMatch fabricates pairs of joinable datasets (T1) from each original one in the repository to build a similarity graph (T2) for training the join prediction model (T3). For inference, OmniMatch constructs the similarity graph of the columns stemming from the original datasets (I1) and uses the trained model for inference on it (I2).
  • ...and 5 more figures

Theorems & Definitions (2)

  • Definition 2.1: Equi-join
  • Definition 2.2: Fuzzy-join