Evaluating Joinable Column Discovery Approaches for Context-Aware Search
Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, Kavitha Srinivas
TL;DR
The paper tackles joinable column discovery in heterogeneous enterprise data by formalizing a context-aware task and evaluating six joinability criteria across seven benchmarks. It compares equi-join and semantic methods and introduces TOPJoin, an ensemble that integrates syntactic, metadata, and value signals via TOPSIS. Key findings show metadata semantics and value semantics are especially impactful in data lakes, while size-based criteria matter more in relational databases, and ensemble approaches consistently outperform single-criterion methods. The study provides actionable guidelines for method selection based on dataset characteristics and releases reproducible artifacts, including code and annotated OpenData resources. Overall, the work advances understanding of when and how to combine criteria for robust, scalable context-aware join discovery.
Abstract
Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for different data characteristics and how multiple criteria influence discovery effectiveness. We present a comprehensive experimental evaluation of joinable column discovery methods across diverse scenarios. Our study compares syntactic and semantic techniques on seven benchmarks covering relational databases and data lakes. We analyze six key criteria -- unique values, intersection size, join size, reverse join size, value semantics, and metadata semantics -- and examine how combining them through ensemble ranking affects performance. Our analysis reveals differences in method behavior across data contexts and highlights the benefits of integrating multiple criteria for robust join discovery. We provide empirical evidence on when each criterion matters, compare pre-trained embedding models for semantic joins, and offer practical guidelines for selecting suitable methods based on dataset characteristics. Our findings show that metadata and value semantics are crucial for data lakes, size-based criteria play a stronger role in relational databases, and ensemble approaches consistently outperform single-criterion methods.
