Table of Contents
Fetching ...

Evaluating Joinable Column Discovery Approaches for Context-Aware Search

Harsha Kokel, Aamod Khatiwada, Tejaswini Pedapati, Haritha Ananthakrishnan, Oktie Hassanzadeh, Horst Samulowitz, Kavitha Srinivas

TL;DR

The paper tackles joinable column discovery in heterogeneous enterprise data by formalizing a context-aware task and evaluating six joinability criteria across seven benchmarks. It compares equi-join and semantic methods and introduces TOPJoin, an ensemble that integrates syntactic, metadata, and value signals via TOPSIS. Key findings show metadata semantics and value semantics are especially impactful in data lakes, while size-based criteria matter more in relational databases, and ensemble approaches consistently outperform single-criterion methods. The study provides actionable guidelines for method selection based on dataset characteristics and releases reproducible artifacts, including code and annotated OpenData resources. Overall, the work advances understanding of when and how to combine criteria for robust, scalable context-aware join discovery.

Abstract

Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for different data characteristics and how multiple criteria influence discovery effectiveness. We present a comprehensive experimental evaluation of joinable column discovery methods across diverse scenarios. Our study compares syntactic and semantic techniques on seven benchmarks covering relational databases and data lakes. We analyze six key criteria -- unique values, intersection size, join size, reverse join size, value semantics, and metadata semantics -- and examine how combining them through ensemble ranking affects performance. Our analysis reveals differences in method behavior across data contexts and highlights the benefits of integrating multiple criteria for robust join discovery. We provide empirical evidence on when each criterion matters, compare pre-trained embedding models for semantic joins, and offer practical guidelines for selecting suitable methods based on dataset characteristics. Our findings show that metadata and value semantics are crucial for data lakes, size-based criteria play a stronger role in relational databases, and ensemble approaches consistently outperform single-criterion methods.

Evaluating Joinable Column Discovery Approaches for Context-Aware Search

TL;DR

The paper tackles joinable column discovery in heterogeneous enterprise data by formalizing a context-aware task and evaluating six joinability criteria across seven benchmarks. It compares equi-join and semantic methods and introduces TOPJoin, an ensemble that integrates syntactic, metadata, and value signals via TOPSIS. Key findings show metadata semantics and value semantics are especially impactful in data lakes, while size-based criteria matter more in relational databases, and ensemble approaches consistently outperform single-criterion methods. The study provides actionable guidelines for method selection based on dataset characteristics and releases reproducible artifacts, including code and annotated OpenData resources. Overall, the work advances understanding of when and how to combine criteria for robust, scalable context-aware join discovery.

Abstract

Joinable Column Discovery is a critical challenge in automating enterprise data analysis. While existing approaches focus on syntactic overlap and semantic similarity, there remains limited understanding of which methods perform best for different data characteristics and how multiple criteria influence discovery effectiveness. We present a comprehensive experimental evaluation of joinable column discovery methods across diverse scenarios. Our study compares syntactic and semantic techniques on seven benchmarks covering relational databases and data lakes. We analyze six key criteria -- unique values, intersection size, join size, reverse join size, value semantics, and metadata semantics -- and examine how combining them through ensemble ranking affects performance. Our analysis reveals differences in method behavior across data contexts and highlights the benefits of integrating multiple criteria for robust join discovery. We provide empirical evidence on when each criterion matters, compare pre-trained embedding models for semantic joins, and offer practical guidelines for selecting suitable methods based on dataset characteristics. Our findings show that metadata and value semantics are crucial for data lakes, size-based criteria play a stronger role in relational databases, and ensemble approaches consistently outperform single-criterion methods.

Paper Structure

This paper contains 33 sections, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Snapshot of the annotation tool that was developed to collect human annotations for context-aware joinable columns in OpenData. The source metadata on the left belongs to \ref{['fig:topjoin_main']}(a), the query table in \ref{['ex:context_aware_join']}, and the target metadata on the right belongs to \ref{['fig:topjoin_main']}(b). Annotators had access to the complete metadata and the first 1000 rows of both tables. Additionally, they were able to preview the first 1500 rows of the resulting joined table.
  • Figure 2: Architecture of the ensemble-based join discovery approach.
  • Figure 3: Effectiveness of different methods in Relational DB Benchmarks.
  • Figure 4: Effectiveness of different methods in Data lake Benchmarks.
  • Figure 5: Effectiveness of different methods in Fuzzy Join Benchmarks.
  • ...and 1 more figures

Theorems & Definitions (2)

  • definition 1
  • definition 2