Snoopy: Effective and Efficient Semantic Join Discovery via Proxy Columns
Yuxiang Guo, Yuren Mao, Zhonghao Hu, Lu Chen, Yunjun Gao
TL;DR
This work tackles semantic join discovery for dataset search by identifying top-k repository columns that semantically match a query column. It introduces Snoopy, a proxy-column-based framework that derives column embeddings from column-to-proxy-column relations via an AGM-based projection and a rank-aware contrastive learning paradigm to learn proxy-column matrices. Offline training pre-computes embeddings with HNSW indexing, enabling fast online query encoding and retrieval that significantly outperforms state-of-the-art column-level methods (e.g., Recall@25 +16% and NDCG@25 +10%) while being at least 5 orders of magnitude faster than cell-level solutions and 3.5x faster than existing column-level methods. Overall, Snoopy effectively bridges the semantics-joinability-gap, overcomes size and permutation sensitivity, and enables scalable, high-quality semantic join discovery for large table repositories.
Abstract
Semantic join discovery, which aims to find columns in a table repository with high semantic joinabilities to a query column, is crucial for dataset discovery. Existing methods can be divided into two categories: cell-level methods and column-level methods. However, neither of them ensures both effectiveness and efficiency simultaneously. Cell-level methods, which compute the joinability by counting cell matches between columns, enjoy ideal effectiveness but suffer poor efficiency. In contrast, column-level methods, which determine joinability only by computing the similarity of column embeddings, enjoy proper efficiency but suffer poor effectiveness due to the issues occurring in their column embeddings: (i) semantics-joinability-gap, (ii) size limit, and (iii) permutation sensitivity. To address these issues, this paper proposes to compute column embeddings via proxy columns; furthermore, a novel column-level semantic join discovery framework, Snoopy, is presented, leveraging proxy-column-based embeddings to bridge effectiveness and efficiency. Specifically, the proposed column embeddings are derived from the implicit column-to-proxy-column relationships, which are captured by the lightweight approximate-graph-matching-based column projection.To acquire good proxy columns for guiding the column projection, we introduce a rank-aware contrastive learning paradigm. Extensive experiments on four real-world datasets demonstrate that Snoopy outperforms SOTA column-level methods by 16% in Recall@25 and 10% in NDCG@25, and achieves superior efficiency--being at least 5 orders of magnitude faster than cell-level solutions, and 3.5x faster than existing column-level methods.
