Table of Contents
Fetching ...

Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

Yuxiang Guo, Yuren Mao, Zhonghao Hu, Lu Chen, Yunjun Gao

TL;DR

This work tackles semantic join discovery for dataset search by identifying top-k repository columns that semantically match a query column. It introduces Snoopy, a proxy-column-based framework that derives column embeddings from column-to-proxy-column relations via an AGM-based projection and a rank-aware contrastive learning paradigm to learn proxy-column matrices. Offline training pre-computes embeddings with HNSW indexing, enabling fast online query encoding and retrieval that significantly outperforms state-of-the-art column-level methods (e.g., Recall@25 +16% and NDCG@25 +10%) while being at least 5 orders of magnitude faster than cell-level solutions and 3.5x faster than existing column-level methods. Overall, Snoopy effectively bridges the semantics-joinability-gap, overcomes size and permutation sensitivity, and enables scalable, high-quality semantic join discovery for large table repositories.

Abstract

Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, Snoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection.To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that Snoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency--being at least 5 orders of magnitude faster than cell-level solutions, and 3.5x faster than existing column-level methods.

Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns

TL;DR

This work tackles semantic join discovery for dataset search by identifying top-k repository columns that semantically match a query column. It introduces Snoopy, a proxy-column-based framework that derives column embeddings from column-to-proxy-column relations via an AGM-based projection and a rank-aware contrastive learning paradigm to learn proxy-column matrices. Offline training pre-computes embeddings with HNSW indexing, enabling fast online query encoding and retrieval that significantly outperforms state-of-the-art column-level methods (e.g., Recall@25 +16% and NDCG@25 +10%) while being at least 5 orders of magnitude faster than cell-level solutions and 3.5x faster than existing column-level methods. Overall, Snoopy effectively bridges the semantics-joinability-gap, overcomes size and permutation sensitivity, and enables scalable, high-quality semantic join discovery for large table repositories.

Abstract

Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, Snoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection.To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that Snoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency--being at least 5 orders of magnitude faster than cell-level solutions, and 3.5x faster than existing column-level methods.

Paper Structure

This paper contains 32 sections, 1 theorem, 11 equations, 9 figures, 11 tables, 1 algorithm.

Key Result

Theorem 1

The proposed column representation $f(C)$ is size-unlimited and permutation-invariant.

Figures (9)

  • Figure 1: An example of semantic join discovery. The cells in columns $C_Q$, $C_1$ and $C_2$ with the same grayscale are matched.
  • Figure 2: (a) The column-to-column joinabilities that exact solutions should assess. (b) The existing column-level solutions encode columns independently without column-to-column interactions. (c) Our proposed proxy columns capture column-to-proxy-column joinabilities and are used to derive column embeddings. For all repository columns $C_1, \dots, C_x$, c2pc values can be pre-computed offline. During the online stage, only the computation $C_Q - P_i$ is needed to generate the embedding for the query column $C_Q$.
  • Figure 3: Overview of Snoopy. "Column" is abbreviated as "Col".
  • Figure 4: An illustration of rank-aware contrastive learning.
  • Figure 5: Visualization of column embeddings of Opendata learned by Deepjoin and our proposed Snoopy.
  • ...and 4 more figures

Theorems & Definitions (12)

  • Example 1
  • Definition 1
  • Definition 2
  • Definition 3
  • Definition 4
  • Definition 5
  • Definition 6
  • Example 2
  • Definition 7
  • Theorem 1
  • ...and 2 more