Table of Contents
Fetching ...

FREYJA: Efficient Join Discovery in Data Lakes

Marc Maynou, Sergi Nadal, Raquel Panadero, Javier Flores, Oscar Romero, Anna Queralt

TL;DR

Freyja tackles efficient join discovery in data lakes by introducing a semantic-aware join quality metric that blends a multiset Jaccard component with a cardinality-proportion signal. It then predicts this metric using a lightweight model trained on data profiles, enabling scalable, generalizable inference without heavy embeddings or fine-tuning. The approach delivers state-of-the-art effectiveness on synthetic benchmarks while drastically reducing preprocessing time and hardware requirements, achieving orders-of-magnitude efficiency gains. The work also provides a transparent, explainable framework with potential for hybrid pipelines and data augmentation tasks in real-world data-lake environments.

Abstract

Data lakes are massive repositories of raw and heterogeneous data, designed to meet the requirements of modern data storage. Nonetheless, this same philosophy increases the complexity of performing discovery tasks to find relevant data for subsequent processing. As a response to these growing challenges, we present FREYJA, a modern data discovery system capable of effectively exploring data lakes, aimed at finding candidates to perform joins and increase the number of attributes for downstream tasks. More precisely, we want to compute rankings that sort potential joins by their relevance. Modern mechanisms apply advanced table representation learning (TRL) techniques to yield accurate joins. Yet, this incurs high computational costs when dealing with elevated volumes of data. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes, which leverages syntactic measurements while achieving accuracy comparable to that of TRL approaches. To obtain this metric in a scalable manner we train a general purpose predictive model. Predictions are based, rather than on large-scale datasets, on data profiles, succinct representations that capture the underlying characteristics of the data. Our experiments show that our system, FREYJA, matches the results of the state-of-the-art whilst reducing the execution times by several orders of magnitude.

FREYJA: Efficient Join Discovery in Data Lakes

TL;DR

Freyja tackles efficient join discovery in data lakes by introducing a semantic-aware join quality metric that blends a multiset Jaccard component with a cardinality-proportion signal. It then predicts this metric using a lightweight model trained on data profiles, enabling scalable, generalizable inference without heavy embeddings or fine-tuning. The approach delivers state-of-the-art effectiveness on synthetic benchmarks while drastically reducing preprocessing time and hardware requirements, achieving orders-of-magnitude efficiency gains. The work also provides a transparent, explainable framework with potential for hybrid pipelines and data augmentation tasks in real-world data-lake environments.

Abstract

Data lakes are massive repositories of raw and heterogeneous data, designed to meet the requirements of modern data storage. Nonetheless, this same philosophy increases the complexity of performing discovery tasks to find relevant data for subsequent processing. As a response to these growing challenges, we present FREYJA, a modern data discovery system capable of effectively exploring data lakes, aimed at finding candidates to perform joins and increase the number of attributes for downstream tasks. More precisely, we want to compute rankings that sort potential joins by their relevance. Modern mechanisms apply advanced table representation learning (TRL) techniques to yield accurate joins. Yet, this incurs high computational costs when dealing with elevated volumes of data. In contrast to the state-of-the-art, we adopt a novel notion of join quality tailored to data lakes, which leverages syntactic measurements while achieving accuracy comparable to that of TRL approaches. To obtain this metric in a scalable manner we train a general purpose predictive model. Predictions are based, rather than on large-scale datasets, on data profiles, succinct representations that capture the underlying characteristics of the data. Our experiments show that our system, FREYJA, matches the results of the state-of-the-art whilst reducing the execution times by several orders of magnitude.

Paper Structure

This paper contains 20 sections, 5 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 2: Average Kendall's $\tau$ coefficient between embedding-based quality rankings and set-overlap metrics
  • Figure 3: Distribution of the cardinality proportion ($K$) for semantic vs. syntactic joins on the Freyja$_{\text{BM}}$ benchmark
  • Figure 4: Distribution of ground truth labels over $\mathcal{J}$ and $K$
  • Figure 5: Distribution of $Q(A,B,L)$ for the case of $L=4$
  • Figure 6: Resulting cdfs that minimize the Wasserstein distance over the edf of $\mathcal{J}$ (left) and $K$ (right) for $Q(A,B,4)$
  • ...and 10 more figures